Large Language Models are Inconsistent and Biased Evaluators (2405.01724v1)
Abstract: The zero-shot capability of LLMs has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, Online. Association for Computational Linguistics.
- Chateval: Towards better llm-based evaluators through multi-agent debate.
- Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study.
- Polyie: A dataset of information extraction from polymer material scientific literature.
- Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
- Nikolas Coupland. 2011. How frequent are numbers? Language & Communication, 31(1):27–37.
- On the limitations of reference-free evaluations of generated text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10960–10977, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
- Findings of the WMT 2019 shared tasks on quality estimation. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 1–10, Florence, Italy. Association for Computational Linguistics.
- Gptscore: Evaluate as you desire.
- Human-like summarization evaluation with chatgpt.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics.
- On the round number bias and wisdom of crowds in different response formats for numerical estimation. Scientific Reports, 12(1):8167.
- An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
- Zdeněk Kasner and Ondřej Dušek. 2024. Beyond reference-based metrics: Analyzing behaviors of open llms on data-to-text generation.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality.
- Benchmarking cognitive biases in large language models as evaluators.
- Leveraging large language models for nlg evaluation: A survey.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- LLMs as narcissistic evaluators: When ego inflates evaluation scores.
- Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
- Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300.
- Chatgpt as a factual inconsistency evaluator for text summarization.
- Abstractive text summarization using sequence-to-sequence rnns and beyond.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.
- Tomoko Nemoto and David Beglar. 2014. Likert-scale questionnaires. In JALT 2013 conference proceedings, pages 1–8.
- Likelihood-based mitigation of evaluation bias in large language models.
- Llm evaluators recognize and favor their own generations.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics.
- QuestEval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Answers unite! unsupervised metrics for reinforced summarization models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3246–3256, Hong Kong, China. Association for Computational Linguistics.
- BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233, Singapore. Association for Computational Linguistics.
- Characterizing the confidence of large language model-based automatic evaluation metrics. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 76–89, St. Julian’s, Malta. Association for Computational Linguistics.
- Manoj Thomas and Vicki Morwitz. 2009. Heuristics in numerical cognition: Implications for pricing. In Handbook of pricing research in marketing, pages 132–149. Edward Elgar Publishing.
- Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131.
- Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11–20, Online. Association for Computational Linguistics.
- Is chatgpt a good nlg evaluator? a preliminary study.
- Large language models are not fair evaluators.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models.
- Less is more for long document summary evaluation by llms.
- Robert B Zajonc. 1968. Attitudinal effects of mere exposure. Journal of personality and social psychology, 9(2p2):1.
- BertScore: Evaluating text generation with bert.
- MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
- Large language models are not robust multiple choice selectors.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Hierarchical multi-label classification of online vaccine concerns.
- Rickard Stureborg (9 papers)
- Dimitris Alikaniotis (4 papers)
- Yoshi Suhara (14 papers)