Finding Blind Spots in Evaluator LLMs with Interpretable Checklists (2406.13439v2)

Published 19 Jun 2024 in cs.CL

Abstract: LLMs are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at https://github.com/AI4Bharat/FBI.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a novel FBI framework that uses interpretable checklists to uncover evaluation blind spots in large language models.
Methodologies including single-answer scoring, pairwise comparisons, and reference-guided evaluations reveal that Evaluator LLMs miss over 50% of quality degradations.
The findings indicate that even advanced rubric-based strategies are insufficient, highlighting the need for more robust evaluation approaches in future LLM research.

Analyzing the Efficacy of Evaluator LLMs using FBI Framework

The paper "Finding Blind Spots in Evaluator LLMs with Interpretable Checklists" examines the effectiveness of LLMs as evaluators for text generation tasks. It introduces a novel meta-evaluation framework, FBI, designed to evaluate the proficiency of Evaluator LLMs in assessing various critical text generation abilities such as factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. This paper is carried out through targeted perturbations in the generated responses of LLMs.

Methodology

In this comprehensive paper, the authors use a dataset of 2400 perturbed answers generated by systematically introducing errors into the outputs of an LLM to test the evaluation capability of various Evaluator LLMs. The perturbations introduced span a wide array of categories, carefully chosen to cover the aforementioned text generation abilities. Each instance of the dataset comprises a prompt, a gold standard answer, and a perturbed answer. The perturbations are crafted to degrade the quality of responses specifically targeted to one of the defined capabilities, such as factuality, coherence, or reasoning.

Evaluation Paradigms

The paper employs three primary evaluation paradigms to benchmark the performance of Evaluator LLMs:

Single-Answer Scoring: In this paradigm, Evaluator LLMs are tasked with scoring the responses based solely on their parameterized knowledge. Several strategies are used including vanilla evaluation, rubric-based evaluation, and axis-based evaluation where specific evaluation criteria are highlighted.
Pairwise Comparison: Evaluator LLMs are given two responses -- a gold standard and a perturbed response -- and are required to choose the better response. This paradigm also employs rubrics and specific evaluation axes in some strategies.
Reference-Guided Evaluation: Here, the Evaluator LLMs compare the model response with a reference gold standard answer. This paradigm is tested to see if having a ground-truth reference improves performance.

Findings

Results reveal significant shortcomings in current Evaluator LLMs. The models failed to identify quality drops in over 50% of cases on average, even when advanced evaluation strategies with detailed rubrics and specific axes of evaluation were applied. Notably, reference-based evaluations showed comparatively better performance but still fell short of ideal expectations.

Single-Answer Scoring

The paper finds that while simple evaluation strategies like vanilla evaluation perform relatively better, advanced strategies that include detailed rubrics and explicit axes did not necessarily improve the models' ability to detect subtler errors. Specifically, the Evaluator LLMs struggled significantly with perturbations intended to test factual accuracy, coherence, and other fundamental abilities.

Pairwise Comparison

In pairwise comparisons, the performance was marginally better with advanced strategies. However, the models still exhibited a high rate of failures in detecting perturbed responses. Even when presented with both a gold and perturbed response, the Evaluator LLMs often failed to choose the correct answer, indicating unreliability in such comparative setups.

Reference-Guided Evaluation

When a reference gold standard was provided, the Evaluator LLMs showed an improvement in detection of errors, particularly for reasoning tasks. This suggests that having a point of comparison may aid in better evaluation, albeit still with notable limitations.

Comparison with Other Models

The paper extends its evaluation to include other popular LLMs like Claude-3-Opus, Gemini-1.5-Pro, and Llama-3-70B-Instruct. Interestingly, GPT-4-turbo was found to consistently outperform other models in both single-answer and reference-less pairwise paradigms. Yet, even the trained evaluator models like Prometheus-2 were found to be less effective than general Evaluator LLMs.

Future Directions

The findings underscore the need for improvements in the design and implementation of Evaluator LLMs. There is a clear implication that adopting more sophisticated evaluation strategies alone is insufficient. The paper indicates an urgent need for developing Evaluator LLMs with a deeper understanding of text generation tasks and the nuances of different types of errors. Future developments might include integrating multi-agent meta-evaluation frameworks and extending the FBI checklist to include advanced capabilities such as multilingual text generation and use of external tools.

Conclusion

The authors make a compelling case for the cautious implementation of current Evaluator LLMs in practical applications, given their significant blind spots and unreliability. The FBI framework emerges as a robust tool to scrutinize and benchmark the performance of these evaluators, highlighting the critical need for continued refinement in this area of research. The framework not only lays a foundation for more reliable model assessment but also opens avenues for future research to address these documented deficiencies.

Related Papers

Tweets

https://twitter.com/maximelabonne/status/1808473304518287595

https://twitter.com/sumanthd17/status/1816831058199085491

https://twitter.com/NithishKannen/status/1808186099749335157

https://twitter.com/arxivsanitybot/status/1804144438794686499