Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization (2305.13091v2)

Published 22 May 2023 in cs.CL

Abstract: With the recent undeniable advancement in reasoning abilities in LLMs like ChatGPT and GPT-4, there is a growing trend for using LLMs on various tasks. One area where LLMs can be employed is as an alternative evaluation metric for complex generative tasks, which generally demands expensive human judges to complement the traditional automatic metrics for various evaluation dimensions such as fluency and consistency. In this work, we conduct extensive analysis to investigate the stability and reliability of LLMs as automatic evaluators for abstractive summarization. We found that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements due to significant limitations. That is, LLM evaluators rate each candidate system inconsistently and are dimension-dependent. They also struggle to compare candidates with close performance and become more unreliable with higher-quality summaries by obtaining a lower correlation with humans. In other words, with better abstractive summarization systems being introduced at a fast pace, LLMs may result in misleading and unreliable evaluations.

Citations (50)

View on Semantic Scholar

Summary

The paper’s main finding is that LLMs underperform human evaluators in abstractive summarization, especially when distinguishing subtle quality differences.
It introduces evaluation methods like RTS and MCQ that reveal candidate- and dimension-specific biases affecting the reliability of automatic assessments.
The study demonstrates that even advanced models like GPT-4 improve evaluation scores but still struggle with stability and consistency compared to human judgment.

LLMs as Evaluators for Abstractive Summarization

The recent developments in LLMs such as ChatGPT and GPT-4 have sparked interest in their potential use as evaluators for abstractive summarization. These models offer a cost-effective and rapid alternative to human evaluation, which is traditionally required to complement automatic metrics. This essay critically examines the paper "LLMs are Not Yet Human-Level Evaluators for Abstractive Summarization" (2305.13091), which analyzes the stability and reliability of LLMs as automatic evaluators.

Evaluation Capabilities of LLMs

The paper investigates the evaluation performance of LLMs across different dimensions of summarization, such as coherence, consistency, fluency, and relevance. It introduces two methods for evaluation: Reason-then-Score (RTS) and Multiple-Choice Question (MCQ), alongside Head-to-Head comparisons (H2H). These are employed to approximate typical human evaluation techniques like Likert-scale scoring and head-to-head comparative assessment.

Correct Preferences

The goal of evaluation is to emulate human judgment in distinguishing between competing summarization systems. When analyzing closely matched candidate pairs, the established evaluation metrics like BERTScore and BARTScore exhibit limitations. ChatGPT using RTS displayed improvements but still struggled with differentiating candidates with small performance discrepancies.

Correlations with Human Judgments

Compared to traditional metrics such as Rouge and even neural metrics like BARTScore, LLM-based evaluators consistently showed a higher correlation with human evaluations. ChatGPT-RTS particularly outperformed existing automatic metrics in aligning with human scores but showed inconsistency when evaluating high-performing models.

Stability and Reliability Analysis

LLM evaluators must be stable across evaluated systems to be considered reliable. The paper explored this stability through per-candidate correlations and a novel meta-correlation metric.

Per-Candidate Correlations

The correlation analysis between LLM scores and human scores for individual candidate systems revealed significant variability, indicating that LLMs might not provide a consistent evaluation standard across different systems. This candidate-dependency suggests a limitation in reliance solely on LLM evaluations.

Summary Quality vs Human Alignment

The paper's meta-correlation metric highlights a troubling trend: as the quality of summaries improves, LLM evaluators show decreasing correlation with human judgments. This negative meta-correlation for dimensions such as consistency and fluency indicates that LLMs become less reliable with higher quality summaries, which could lead to misleading evaluations as summarization systems evolve.

Evaluation with GPT-4

In an attempt to assess whether stronger LLMs resolve these issues, the paper extends its evaluation using GPT-4. While GPT-4 outperforms ChatGPT in several correlation metrics, it still suffers from candidate- and dimension-dependency. Furthermore, it exhibits more significant challenges in maintaining consistent evaluation quality, particularly in relevance dimensions, indicating the difficulty in balancing informativeness with consistency.

A Temporary Framework for Practitioners

Recognizing the potential of LLMs despite their limitations, the paper recommends using a combined evaluation framework. By calculating the correlation between RTS and MCQ scores, practitioners may gauge the reliability of LLM evaluations. This framework provides an initial indicator of where further human evaluation might be necessary.

Conclusion

The analysis in the paper indicates that while LLMs demonstrate impressive potential as evaluators for abstractive summarization, they are not yet ready to replace human evaluators entirely. They show improvement over traditional metrics but still exhibit system and dimension-specific biases. There is a clear need for better automatic metrics, and in the interim, researchers should employ a combined evaluation approach to determine when LLMs are reliable. Future research should focus on enhancing the stability and dimension-agnostic properties of LLM evaluations to improve their applicability in real-world tasks.