Emergent Mind

Quantifying Variance in Evaluation Benchmarks

(2406.10229)
Published Jun 14, 2024 in cs.LG and cs.AI

Abstract

Evaluation benchmarks are the cornerstone of measuring capabilities of LLMs, as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.

Increased variance in benchmark mean estimates using IRT or IRT++ during pretraining.

Overview

  • The paper examines the inherent variance in evaluation benchmarks for LLMs and its impact on model performance assessment.

  • It presents an empirical study across 13 NLP benchmarks and over 280 models, offering a granular analysis of expected variance and the limitations of traditional variance reduction methods.

  • The study suggests adopting continuous metrics and alternative formats like MMLU-Cloze for more reliable model evaluations and calls for future research to develop LLM-specific techniques for variance reduction.

Quantifying Variance in Evaluation Benchmarks: An Overview

The paper "Quantifying Variance in Evaluation Benchmarks" addresses a foundational challenge in evaluating LLMs: the inherent variance in benchmark scores. Traditionally, evaluation benchmarks have been used to assess the capabilities of LLMs, guiding both research and development by providing comparative performance metrics. However, the paper highlights a significant oversight: the lack of quantification of variance within these benchmarks, which can obscure meaningful differences in model performance.

Empirical Analysis of Variance

The authors conduct an extensive empirical study across 13 widely recognized NLP benchmarks and over 280 models, including both intermediate checkpoints and fully-trained public models. The key contributions of this study include:

  • Comprehensive Reference Guide: The paper provides a granular analysis of expected variance magnitudes across benchmarks under different conditions, notably capturing seed variance and its implications.
  • Recommendations for Variance Reduction: For specific cases, such as smaller models in choice tasks like MMLU, techniques to reduce variance are proposed, though these methods are not universally applicable.
  • Caution Against Ineffective Methods: The utility of traditional methods from human testing literature, such as item analysis and item response theory (IRT), is critically evaluated, revealing their limitations in effectively reducing variance for LLM evaluations.

Seed Variance and Confidence Intervals

The findings underscore significant seed variance across the different benchmarks, contextualized by bootstrapped 95% confidence intervals. For some benchmarks like AGIEval and MMLU, the observed performance is near chance levels even after extensive training, reflecting high variance and low signal-to-noise ratios. The study suggests that continuous performance metrics often yield better predictive stability and a higher signal-to-noise ratio compared to discrete metrics traditionally used in benchmarks.

Monotonicity and Performance Development

The paper also introduces and measures monotonicity—a metric indicating how stably benchmark scores develop during training. Continuous metrics tend to have higher monotonicity, which makes them more reliable indicators of model improvement over time. This reinforces the suggestion that continuous metrics can enhance evaluation accuracy, particularly during iterative model development.

Case Study: MMLU vs. MMLU-Cloze

A notable case study within the paper contrasts the traditional MMLU format with a cloze formulation (MMLU-Cloze). The analysis reveals that MMLU-Cloze, though non-standard, performs better during early training stages due to lower variance and higher monotonicity. Interestingly, while larger models perform better on standard MMLU, the performance on MMLU-Cloze is highly correlated, making MMLU-Cloze a potentially more stable alternative for early evaluations.

Item Analysis and Its Limitations

The study applies item analysis to understand the properties of individual benchmark items (questions), examining metrics like item difficulty and discrimination. Surprisingly, item discrimination calculated on weaker models does not correlate well with stronger models, limiting the utility of item analysis in making informed evaluations. Pruning low-discrimination items can slightly reduce standard error but also drifts the mean performance estimate, suggesting limited practical benefits.

Reassessing Item Response Theory (IRT)

Extending the evaluation, the paper applies IRT-based methods to create smaller, more efficient benchmarks. While promising for estimating mean performance, these methods introduce increased seed variance and reduced monotonicity, complicating model comparisons. Thus, while IRT-based methods can offer efficiency gains, they are less reliable for nuanced performance comparisons during model development.

Implications and Future Directions

The implications of this research are multifaceted:

  • Practical Considerations: Model practitioners are encouraged to consider continuous metrics and alternative formats like MMLU-Cloze, particularly for early-stage evaluations.
  • Caution in Statistical Methods: Traditional item analysis and IRT, though useful in human testing, are less effective for LLM evaluations due to increased variance and ranking inconsistencies.
  • Further Exploration: Future research could explore the underlying causes of these limitations and seek LLM-specific evaluation techniques to reliably reduce variance and improve monotonicity.

Overall, this paper provides a detailed empirical foundation for understanding evaluation benchmark variance in LLMs, offering practical guidelines for more reliable model comparisons and highlighting areas for further methodological innovation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.