The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models (2406.05761v2)

Published 9 Jun 2024 in cs.CL

Abstract: As LLMs (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a novel benchmark that assesses 103 language models using nine specific capabilities across 765 instances.
It employs five evaluator LMs, including GPT-4, and validates open-source models through continual feedback training and self-consistency decoding.
Results show that scaling and post-training strategies improve performance, though challenges remain in reasoning and tool usage.

The BiGGen Bench: An In-depth Analysis of a Principled Benchmark for Fine-grained Evaluation of LLMs

The rapid advancement of LLMs (LMs) has significantly expanded their potential applications, resulting in a corresponding need for robust evaluation mechanisms that can rigorously assess their varied capabilities. The BiGGen Bench emerges as a notable contribution in this area, providing a detailed and systematic benchmarking methodology that evaluates LMs across a broad spectrum of functionalities with fine-grained criteria. This paper introduces the BiGGen Bench and presents analysis and results from the evaluation of 103 frontier LLMs.

Core Tenets of BiGGen Bench

Fine-Grained, Instance-Specific Evaluation:

BiGGen Bench is distinctive in its use of instance-specific evaluation criteria, moving beyond traditional broad measures of "helpfulness" and "harmlessness" often found in existing benchmarks. Instead, it focuses on nine core capabilities: instruction following, grounding, planning, reasoning, refinement, safety, theory of mind, tool usage, and multilingualism, evaluated across 77 tasks and 765 instances. This approach aligns more closely with the nuanced discernment typical of human evaluators, allowing for a deeper and more precise assessment of each LM's strengths and weaknesses.

Evaluator LLMs:

The benchmark leverages evaluator LMs to conduct the assessments, addressing the challenge of subjective evaluation by standardizing the scoring criteria. Five evaluator LMs, including proprietary models like GPT-4, are used to ensure reliable and consistent grading. Additionally, the work explores the reliability of open-source evaluator LMs enhanced through techniques such as continual feedback training and self-consistency decoding.

Evaluation Protocol and Construction

The BiGGen Bench employs a comprehensive evaluation protocol where each LM response is assessed based on detailed rubrics set for each task. The designing of these rubrics and tasks utilizes a human-in-the-loop approach, ensuring relevance and difficulty appropriate for the capabilities being tested. Furthermore, a rigorous cross-validation step is included to maintain the high quality of the instances.

To test the robustness of the evaluation mechanism, the paper also includes human grading to measure the correlation between scores from evaluator LMs and human evaluators. Prominently, the results indicate a significant correlation, validating the efficacy of evaluator LMs in mimicking human judgment for nuanced tasks.

Key Findings and Analysis

Scaling Performance:

The analysis reveals that performance improvements correlate predictably with scale in both pre-trained and post-trained models, albeit with notable differences. For pre-trained models, called 'base LMs,' an increase in size leads to predictable enhancements across various capabilities. In contrast, for 'chat LMs,' which include instruction tuning or RLHF (Reinforcement Learning from Human Feedback), scaling—while beneficial—is not the sole determinant of performance. This underscores the importance of the post-training process in achieving optimal model performance.

Capability-Specific Insights:

The paper provides granular insights by evaluating specific capabilities. For instance, gaps in reasoning and tool usage capabilities between pre-trained, post-trained, and proprietary LMs persist and remain significant even with scaling. In contrast, capabilities like instruction following show a more substantial improvement, closing the gap more effectively as model sizes increase.

Reliability of Open-Source Evaluator LMs:

An important contribution of the paper is the validation of open-source evaluator LMs. Through continual feedback training and self-consistency decoding, the research demonstrates that these models can achieve performance levels comparable to proprietary LMs, ensuring accessible and transparent evaluations. This opens pathways for developing robust internal evaluation pipelines without the recurrent costs associated with API calls to proprietary LMs.

Implications and Future Directions

Theoretical and Practical Implications:

The BiGGen Bench's approach and findings highlight crucial implications for both the development and deployment of LMs. The fine-grained, instance-specific evaluation ensures detailed performance feedback, critical for identifying specific areas requiring improvement. Furthermore, the demonstrated reliability of open-source evaluator LMs suggests a feasible pathway toward scalable, cost-effective, and transparent AI evaluation infrastructures.

Future Directions:

Considering the advances and insights presented, future research could further investigate the specific data and training regimes beneficial for enhancing capabilities that remain challenging, such as reasoning and tool usage. Additionally, exploring more sophisticated methods for evaluator LMs specialized in certain capabilities could significantly enhance the reliability and precision of AI evaluations.

In conclusion, the BiGGen Bench sets a high standard for LM evaluation by providing a nuanced, rigorous, and systematic approach. This benchmark's ability to closely mimic human judgment paves the way for more precise and actionable insights in the development of next-generation LLMs, ensuring their capabilities are thoroughly and accurately assessed.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/seungonekim/status/1800898070638682590

https://twitter.com/LChoshen/status/1800536794179760168

https://twitter.com/wellecks/status/1915931875975057523

https://twitter.com/jinheonbaek/status/1915901607767904382

https://twitter.com/itsnamgyu/status/1915826062551888195

https://twitter.com/gm8xx8/status/1801020664272285877

YouTube

Show All Videos