- The paper’s main finding is that LLMs underperform human evaluators in abstractive summarization, especially when distinguishing subtle quality differences.
- It introduces evaluation methods like RTS and MCQ that reveal candidate- and dimension-specific biases affecting the reliability of automatic assessments.
- The study demonstrates that even advanced models like GPT-4 improve evaluation scores but still struggle with stability and consistency compared to human judgment.
LLMs as Evaluators for Abstractive Summarization
The recent developments in LLMs such as ChatGPT and GPT-4 have sparked interest in their potential use as evaluators for abstractive summarization. These models offer a cost-effective and rapid alternative to human evaluation, which is traditionally required to complement automatic metrics. This essay critically examines the paper "LLMs are Not Yet Human-Level Evaluators for Abstractive Summarization" (2305.13091), which analyzes the stability and reliability of LLMs as automatic evaluators.
Evaluation Capabilities of LLMs
The paper investigates the evaluation performance of LLMs across different dimensions of summarization, such as coherence, consistency, fluency, and relevance. It introduces two methods for evaluation: Reason-then-Score (RTS) and Multiple-Choice Question (MCQ), alongside Head-to-Head comparisons (H2H). These are employed to approximate typical human evaluation techniques like Likert-scale scoring and head-to-head comparative assessment.
Correct Preferences
The goal of evaluation is to emulate human judgment in distinguishing between competing summarization systems. When analyzing closely matched candidate pairs, the established evaluation metrics like BERTScore and BARTScore exhibit limitations. ChatGPT using RTS displayed improvements but still struggled with differentiating candidates with small performance discrepancies.
Correlations with Human Judgments
Compared to traditional metrics such as Rouge and even neural metrics like BARTScore, LLM-based evaluators consistently showed a higher correlation with human evaluations. ChatGPT-RTS particularly outperformed existing automatic metrics in aligning with human scores but showed inconsistency when evaluating high-performing models.
Stability and Reliability Analysis
LLM evaluators must be stable across evaluated systems to be considered reliable. The paper explored this stability through per-candidate correlations and a novel meta-correlation metric.
Per-Candidate Correlations
The correlation analysis between LLM scores and human scores for individual candidate systems revealed significant variability, indicating that LLMs might not provide a consistent evaluation standard across different systems. This candidate-dependency suggests a limitation in reliance solely on LLM evaluations.
Summary Quality vs Human Alignment
The paper's meta-correlation metric highlights a troubling trend: as the quality of summaries improves, LLM evaluators show decreasing correlation with human judgments. This negative meta-correlation for dimensions such as consistency and fluency indicates that LLMs become less reliable with higher quality summaries, which could lead to misleading evaluations as summarization systems evolve.
Evaluation with GPT-4
In an attempt to assess whether stronger LLMs resolve these issues, the paper extends its evaluation using GPT-4. While GPT-4 outperforms ChatGPT in several correlation metrics, it still suffers from candidate- and dimension-dependency. Furthermore, it exhibits more significant challenges in maintaining consistent evaluation quality, particularly in relevance dimensions, indicating the difficulty in balancing informativeness with consistency.
A Temporary Framework for Practitioners
Recognizing the potential of LLMs despite their limitations, the paper recommends using a combined evaluation framework. By calculating the correlation between RTS and MCQ scores, practitioners may gauge the reliability of LLM evaluations. This framework provides an initial indicator of where further human evaluation might be necessary.
Conclusion
The analysis in the paper indicates that while LLMs demonstrate impressive potential as evaluators for abstractive summarization, they are not yet ready to replace human evaluators entirely. They show improvement over traditional metrics but still exhibit system and dimension-specific biases. There is a clear need for better automatic metrics, and in the interim, researchers should employ a combined evaluation approach to determine when LLMs are reliable. Future research should focus on enhancing the stability and dimension-agnostic properties of LLM evaluations to improve their applicability in real-world tasks.