Emergent Mind

Abstract

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.

Each instance features a fine-grained evaluation criterion for precise performance assessment.

Overview

  • The BiGGen Bench introduces a comprehensive and systematic benchmarking methodology for fine-grained evaluation of 103 frontier language models (LMs), focusing on nine core capabilities across 77 tasks and 765 instances.

  • The benchmark employs five evaluator LMs, including proprietary models like GPT-4, and explores the reliability of improved open-source evaluator LMs using methods such as continual feedback training and self-consistency decoding.

  • Key findings emphasize predictable performance improvements with model scaling, significant gaps in capabilities like reasoning and tool usage, and the promising potential of open-source evaluator LMs for cost-effective and transparent evaluations.

The BiGGen Bench: An In-depth Analysis of a Principled Benchmark for Fine-grained Evaluation of Language Models

The rapid advancement of language models (LMs) has significantly expanded their potential applications, resulting in a corresponding need for robust evaluation mechanisms that can rigorously assess their varied capabilities. The BiGGen Bench emerges as a notable contribution in this area, providing a detailed and systematic benchmarking methodology that evaluates LMs across a broad spectrum of functionalities with fine-grained criteria. This paper introduces the BiGGen Bench and presents analysis and results from the evaluation of 103 frontier language models.

Core Tenets of BiGGen Bench

Fine-Grained, Instance-Specific Evaluation: BiGGen Bench is distinctive in its use of instance-specific evaluation criteria, moving beyond traditional broad measures of "helpfulness" and "harmlessness" often found in existing benchmarks. Instead, it focuses on nine core capabilities: instruction following, grounding, planning, reasoning, refinement, safety, theory of mind, tool usage, and multilingualism, evaluated across 77 tasks and 765 instances. This approach aligns more closely with the nuanced discernment typical of human evaluators, allowing for a deeper and more precise assessment of each LM's strengths and weaknesses.

Evaluator Language Models: The benchmark leverages evaluator LMs to conduct the assessments, addressing the challenge of subjective evaluation by standardizing the scoring criteria. Five evaluator LMs, including proprietary models like GPT-4, are used to ensure reliable and consistent grading. Additionally, the work explores the reliability of open-source evaluator LMs enhanced through techniques such as continual feedback training and self-consistency decoding.

Evaluation Protocol and Construction

The BiGGen Bench employs a comprehensive evaluation protocol where each LM response is assessed based on detailed rubrics set for each task. The designing of these rubrics and tasks utilizes a human-in-the-loop approach, ensuring relevance and difficulty appropriate for the capabilities being tested. Furthermore, a rigorous cross-validation step is included to maintain the high quality of the instances.

To test the robustness of the evaluation mechanism, the paper also includes human grading to measure the correlation between scores from evaluator LMs and human evaluators. Prominently, the results indicate a significant correlation, validating the efficacy of evaluator LMs in mimicking human judgment for nuanced tasks.

Key Findings and Analysis

Scaling Performance: The analysis reveals that performance improvements correlate predictably with scale in both pre-trained and post-trained models, albeit with notable differences. For pre-trained models, called 'base LMs,' an increase in size leads to predictable enhancements across various capabilities. In contrast, for 'chat LMs,' which include instruction tuning or RLHF (Reinforcement Learning from Human Feedback), scaling—while beneficial—is not the sole determinant of performance. This underscores the importance of the post-training process in achieving optimal model performance.

Capability-Specific Insights: The study provides granular insights by evaluating specific capabilities. For instance, gaps in reasoning and tool usage capabilities between pre-trained, post-trained, and proprietary LMs persist and remain significant even with scaling. In contrast, capabilities like instruction following show a more substantial improvement, closing the gap more effectively as model sizes increase.

Reliability of Open-Source Evaluator LMs: An important contribution of the paper is the validation of open-source evaluator LMs. Through continual feedback training and self-consistency decoding, the research demonstrates that these models can achieve performance levels comparable to proprietary LMs, ensuring accessible and transparent evaluations. This opens pathways for developing robust internal evaluation pipelines without the recurrent costs associated with API calls to proprietary LMs.

Implications and Future Directions

Theoretical and Practical Implications: The BiGGen Bench's approach and findings highlight crucial implications for both the development and deployment of LMs. The fine-grained, instance-specific evaluation ensures detailed performance feedback, critical for identifying specific areas requiring improvement. Furthermore, the demonstrated reliability of open-source evaluator LMs suggests a feasible pathway toward scalable, cost-effective, and transparent AI evaluation infrastructures.

Future Directions: Considering the advances and insights presented, future research could further investigate the specific data and training regimes beneficial for enhancing capabilities that remain challenging, such as reasoning and tool usage. Additionally, exploring more sophisticated methods for evaluator LMs specialized in certain capabilities could significantly enhance the reliability and precision of AI evaluations.

In conclusion, the BiGGen Bench sets a high standard for LM evaluation by providing a nuanced, rigorous, and systematic approach. This benchmark's ability to closely mimic human judgment paves the way for more precise and actionable insights in the development of next-generation language models, ensuring their capabilities are thoroughly and accurately assessed.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.