Emergent Mind

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

(2402.01781)
Published Feb 1, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple choice question benchmarks (e.g. MMLU) minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions. We explain this phenomenon by conducting systematic experiments over three broad categories of benchmark perturbations and identifying the sources of this behavior. Our analysis results in several best-practice recommendations, including the advantage of a hybrid scoring method for answer selection. Our study highlights the dangers of relying on simple benchmark evaluations and charts the path for more robust evaluation schemes on the existing benchmarks.

Minor perturbations lead to significant ranking changes on MMLU.

Overview

  • The paper assesses the sensitivity of Large Language Model (LLM) leaderboards to minor changes in Multiple Choice Questions (MCQ) benchmarks, revealing significant instability in rankings.

  • Experimental results show that slight modifications, such as the order of choices or scoring methods, can lead to dramatic ranking shifts among models, indicating fragility in current leaderboard systems.

  • The study underscores the need for more robust and consistent benchmarks to ensure reliable evaluation and highlight the importance of addressing biases and scoring methods in LLM assessments.

When Benchmarks are Targets: Analyzing the Sensitivity of LLM Leaderboards

The study authored by Norah Alzahrani et al. offers a thorough examination of the sensitivity of Large Language Model (LLM) leaderboards to minor perturbations in Multiple Choice Questions (MCQ) benchmarks. These leaderboards, which frequently guide model selection for practitioners, are shown to be unstable under small changes, which has significant implications for both theoretical understanding and practical application.

The authors conducted a systematic series of experiments on well-known MCQ benchmarks, notably the Massive Multitask Language Understanding (MMLU). Their experiments reveal that even trivial modifications, such as altering the order of choices or using different scoring methods, can cause dramatic shifts in model rankings. These changes can result in varying rankings by as many as 8 positions, thereby unequivocally demonstrating the fragility and potential unreliability of these benchmarks.

Key findings from the paper are as follows:

  1. Leaderboard Instability: The authors identified substantial variability in model rankings under slight perturbations. For instance, randomizing the order of answer choices led to major rank changes, notably with the Yi-6b model shifting from 3rd to 9th place.
  2. Sources of Bias: The study explore sources of bias such as token and positional biases. LLMs showed a clear preference for specific choice positions and symbols, and these biases affected model performance unpredictably.
  3. Scoring Method Impact: The choice of scoring method (symbol, hybrid, or cloze scoring) was another significant source of instability. Symbol scoring, despite being the most common, led to high selection biases, while cloze scoring, although reducing bias, resulted in the poorest performance scores. Hybrid scoring provided a more balanced evaluation method.
  4. In-Context Knowledge Sensitivity: In-context manipulations like presenting correct or incorrect answers as part of the context led to models either "cheating" by copying the answers or performing poorly when misleading information was included.

The implications of these findings are manifold. Firstly, there is a need for more robust and consistent benchmark designs to ensure the reliable evaluation of LLMs. The current reliance on unstable benchmarks can lead to inefficiencies and misallocations of resources, especially given the high costs associated with training and deploying LLMs. Secondly, understanding and mitigating biases in LLMs is critical for their applicative validity. Bias towards specific formats or symbols can result in skewed assessments of a model's true capabilities.

Future research should focus on developing benchmarks that are resistant to minor perturbations. This might include integrating a variety of scoring methods, standardizing choice formats, and ensuring the randomization of answer orders in a consistent manner. Additionally, there is a need for transparency in the training datasets used for LLM development to address concerns about potential overfitting to benchmark formats.

In conclusion, this paper underscores the necessity for the AI research community to rethink how LLMs are evaluated and compared. Ensuring the stability and fairness of benchmarks will not only lead to better model selection and resource utilization but also drive the development of more robust AI systems capable of performing reliably in varied and realistic settings. The paper by Alzahrani et al. provides a critical step in this direction, offering valuable insights and practical recommendations for improving the evaluation methodologies in AI research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube