Emergent Mind

Abstract

LLMs are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at https://github.com/AI4Bharat/FBI.

FBI: a new framework for assessing robustness of evaluator LLMs across various tasks and strategies.

Overview

  • The paper introduces and evaluates the FBI framework, a meta-evaluation tool designed to find blind spots in Evaluator LLMs across various text generation abilities.

  • A dataset of 2400 perturbed answers is used to test the evaluation capabilities of multiple Evaluator LLMs through three paradigms: single-answer scoring, pairwise comparison, and reference-guided evaluation.

  • Results indicate significant deficiencies in the current Evaluator LLMs' ability to detect quality drops, with recommendations for future improvements including deeper task comprehension and advanced meta-evaluation frameworks.

Analyzing the Efficacy of Evaluator LLMs using FBI Framework

The paper titled "Finding Blind Spots in Evaluator LLMs with Interpretable Checklists" examines the effectiveness of LLMs as evaluators for text generation tasks. It introduces a novel meta-evaluation framework, FBI, designed to evaluate the proficiency of Evaluator LLMs in assessing various critical text generation abilities such as factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. This study is carried out through targeted perturbations in the generated responses of LLMs.

Methodology

In this comprehensive study, the authors use a dataset of 2400 perturbed answers generated by systematically introducing errors into the outputs of an LLM to test the evaluation capability of various Evaluator LLMs. The perturbations introduced span a wide array of categories, carefully chosen to cover the aforementioned text generation abilities. Each instance of the dataset comprises a prompt, a gold standard answer, and a perturbed answer. The perturbations are crafted to degrade the quality of responses specifically targeted to one of the defined capabilities, such as factuality, coherence, or reasoning.

Evaluation Paradigms

The study employs three primary evaluation paradigms to benchmark the performance of Evaluator LLMs:

  1. Single-Answer Scoring: In this paradigm, Evaluator LLMs are tasked with scoring the responses based solely on their parameterized knowledge. Several strategies are used including vanilla evaluation, rubric-based evaluation, and axis-based evaluation where specific evaluation criteria are highlighted.
  2. Pairwise Comparison: Evaluator LLMs are given two responses -- a gold standard and a perturbed response -- and are required to choose the better response. This paradigm also employs rubrics and specific evaluation axes in some strategies.
  3. Reference-Guided Evaluation: Here, the Evaluator LLMs compare the model response with a reference gold standard answer. This paradigm is tested to see if having a ground-truth reference improves performance.

Findings

Results reveal significant shortcomings in current Evaluator LLMs. The models failed to identify quality drops in over 50% of cases on average, even when advanced evaluation strategies with detailed rubrics and specific axes of evaluation were applied. Notably, reference-based evaluations showed comparatively better performance but still fell short of ideal expectations.

Single-Answer Scoring

The study finds that while simple evaluation strategies like vanilla evaluation perform relatively better, advanced strategies that include detailed rubrics and explicit axes did not necessarily improve the models' ability to detect subtler errors. Specifically, the Evaluator LLMs struggled significantly with perturbations intended to test factual accuracy, coherence, and other fundamental abilities.

Pairwise Comparison

In pairwise comparisons, the performance was marginally better with advanced strategies. However, the models still exhibited a high rate of failures in detecting perturbed responses. Even when presented with both a gold and perturbed response, the Evaluator LLMs often failed to choose the correct answer, indicating unreliability in such comparative setups.

Reference-Guided Evaluation

When a reference gold standard was provided, the Evaluator LLMs showed an improvement in detection of errors, particularly for reasoning tasks. This suggests that having a point of comparison may aid in better evaluation, albeit still with notable limitations.

Comparison with Other Models

The paper extends its evaluation to include other popular LLMs like Claude-3-Opus, Gemini-1.5-Pro, and Llama-3-70B-Instruct. Interestingly, GPT-4-turbo was found to consistently outperform other models in both single-answer and reference-less pairwise paradigms. Yet, even the trained evaluator models like Prometheus-2 were found to be less effective than general Evaluator LLMs.

Future Directions

The findings underscore the need for improvements in the design and implementation of Evaluator LLMs. There is a clear implication that adopting more sophisticated evaluation strategies alone is insufficient. The study indicates an urgent need for developing Evaluator LLMs with a deeper understanding of text generation tasks and the nuances of different types of errors. Future developments might include integrating multi-agent meta-evaluation frameworks and extending the FBI checklist to include advanced capabilities such as multilingual text generation and use of external tools.

Conclusion

The authors make a compelling case for the cautious implementation of current Evaluator LLMs in practical applications, given their significant blind spots and unreliability. The FBI framework emerges as a robust tool to scrutinize and benchmark the performance of these evaluators, highlighting the critical need for continued refinement in this area of research. The framework not only lays a foundation for more reliable model assessment but also opens avenues for future research to address these documented deficiencies.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.