Emergent Mind

Large Language Models are Inconsistent and Biased Evaluators

(2405.01724)
Published May 2, 2024 in cs.CL and cs.AI

Abstract

The zero-shot capability of LLMs has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

Performance comparison on RoSE benchmark showing their method outperforms the SOTA LLM-evaluator for summarization.

Overview

  • The paper discusses the use of LLMs as automatic evaluators for tasks like text summarization, focusing on their effectiveness and highlighting significant issues such as bias and consistency.

  • It identifies specific biases present in LLM evaluators, such as familiarity bias, scoring granularity and score biases, and anchoring effects, and explains how these biases can impair the fairness and reliability of evaluations.

  • The paper also examines the consistency of LLM evaluators, comparing their performance with human evaluators and suggests future directions to enhance their reliability and mitigate bias.

Exploring the Effectiveness and Limitations of LLM Evaluators in Summarization Tasks

Introduction to LLM Evaluators

Automatic evaluation has become a staple in the field of NLP, particularly for tasks like text summarization and machine translation. Traditionally, evaluation metrics such as ROUGE and BLEU have been utilized, which rely heavily on comparing generated text against a set of reference texts. However, these methods have their limitations, primarily when scaling beyond benchmark datasets or when references are not available.

To overcome these challenges, recent developments have turned to LLMs as potential automatic evaluators, which we refer to as LLM evaluators. These models don't require reference texts and generate evaluations based on the content’s quality directly. Despite their growing popularity, critical assessments of their robustness, specifically bias and consistency, are relatively unexplored, which this paper addresses.

Unpacking Bias: Types and Implications

The presence of bias in LLM evaluators can significantly affect their reliability and fairness. Here’s what the research uncovered:

Familiarity Bias:

  • LLM evaluators show a preference for texts that have lower perplexity, indicating a bias toward texts that are more familiar or simpler to them. This bias suggests that evaluators might not judge the content's quality fairly but rather its familiarity to the model's training data.

Scoring Granularity and Score Biases:

  • The research tested various scoring granularities, from simple 1-5 scales to more detailed 1-100 scales. It found evidence of round number bias, where evaluators disproportionately favor scores like 90 or 95, and significant parts of the scoring range (e.g., 1-60) are underutilized.

Anchoring Effects:

  • When LLM evaluators generate evaluations for multiple attributes in one go, the score for an attribute can be unduly influenced by the scores assigned to previous attributes. This anchoring effect shows a strong bias that can skew the evaluators' judgments, particularly in complex multi-attribute evaluations.

Consistency Concerns

Consistency in evaluation is crucial for trust and reliability in automated systems. The paper reveals that:

  • LLM evaluators exhibit a notable degree of inconsistency, not only between different samples but also based on slight changes in how tasks are presented or prompted.
  • Krippendorff’s alpha for inter-sample agreement for LLMs was notably lower than that for human evaluators, highlighting a significant gap in reliability.

Practical Implications and Future Perspectives

The findings indicate that while LLM evaluators hold promise for automating text evaluation tasks, there are critical areas for improvement, primarily around bias and consistency. The paper successfully outlines some adjustments that could mitigate these issues, such as adjusting scoring scales and prompt structures.

Future Directions:

  • Expanding these evaluations to include other LLMs could provide broader insights and help generalize these findings.
  • Deeper exploration into alternative solutions for identified issues, potentially through novel machine learning techniques or more refined prompt engineering, could further enhance the reliability and fairness of LLM evaluators.

Conclusion

This research provides a foundational look at the potential and pitfalls of using LLMs as automatic evaluators in NLP. By highlighting the types of biases and consistency issues these models may have, it lays the groundwork for future investigations and developments in creating more robust and fair automated evaluation systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube