Emergent Mind

Changing Answer Order Can Decrease MMLU Accuracy

(2406.19470)
Published Jun 27, 2024 in cs.CL

Abstract

As LLMs have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.

Accuracy scores on random MMLU categories using proposed shuffling, with question count per category indicated.

Overview

  • Gupta et al. explore the robustness of LLMs by analyzing their performance on the Massive Multitask Language Understanding (MMLU) dataset when the answer order is shuffled.

  • The study reveals significant drops in accuracy across various LLMs, suggesting that these models are not as robust to changes in answer ordering as previously assumed.

  • The authors propose a new metric, termed 'Our Metric,' to better evaluate model robustness and suggest revisions to benchmarking practices to account for this fragility.

Analysis of "Changing Answer Order Can Decrease MMLU Accuracy"

In their paper, Gupta et al. delve into an empirical investigation of the robustness of LLMs when subjected to changes in answer order within a widely-used evaluation framework, the Massive Multitask Language Understanding (MMLU) dataset. The study’s findings carry significant implications for evaluating and interpreting the performance of LLMs, suggesting potential revisions to how benchmarks are constructed and utilized in ranking models on leaderboards.

Introduction and Motivation

Benchmarking LLMs typically involves measuring test accuracy across a suite of tasks, aggregating performance metrics to determine model rankings. Despite the common use of these benchmarks, the underlying fragility of accuracy measurements remains a concern. Prior research has identified various robustness issues, such as sensitivity to paraphrases and minor perturbations in input data. This paper extends that analysis by investigating whether shuffling the content of answer choices affects model accuracy, particularly focusing on the MMLU dataset.

Methodology

The MMLU dataset is a popular benchmark that comprises 57 tasks designed to assess an LLM’s world knowledge and problem-solving abilities. Each task involves multiple-choice questions with four answer options. The authors reconfigured the evaluation process by shuffling the content of answer labels while keeping the labels themselves (A, B, C, D) in the same order. This method ensures that any observed changes in model accuracy stem purely from label content shuffling, not from more extensive alterations.

The core metric employed by the authors, termed "Our Metric," measures a model's robustness by calculating how often it answers the same questions correctly across different shuffles. The metric is an average score of model performance over multiple shuffles, thereby reducing the influence of random chance on observed accuracy.

Experimental Results

The study evaluated ten state-of-the-art LLMs, including both base and instruction-tuned models. These models ranged in size from 7 billion to 70 billion parameters and are well-represented on various leaderboards. Notable models tested include Llama-3, Yi-34B, and Falcon-40B.

All models exhibited a decrease in accuracy upon shuffling answer contents. For instance, the Llama-3-70B-instruct model showed a percentage drop of 6.2%, while the Falcon-40B-instruct model experienced a more substantial decline of 27.2%. This observed degradation suggests that models are not entirely robust to variations in answer ordering, an area previously presumed to be trivial for advanced LLMs.

An analysis of performance drop across subcategories of MMLU indicated that problem-solving tasks were particularly impacted. For example, on high school mathematics questions, the Gemma-7B-instruct model's accuracy decreased by 42.9%. This reveals that LLMs may be benefiting from logical consistency in original answer ordering, a factor that should be neutral in assessing true model understanding and capability.

Theoretical and Practical Implications

These findings challenge the common practice of using averaged test accuracy as a reliable indicator of model performance. The paper calls for a reconsideration of how models are ranked and suggests incorporating robustness metrics that account for variations in answer ordering. Such adjustments could provide a more nuanced and accurate picture of an LLM’s capabilities.

Practically, the implications extend to the design of more robust evaluation frameworks that mitigate overfitting to specific benchmark constructs. The authors highlight that future leaderboards should include metrics that reflect a model's stability across multiple test iterations, thereby encouraging the development of truly robust LLMs.

Conclusions and Future Directions

The consistent decline in accuracy with answer shuffling highlights a critical vulnerability in current LLM evaluation practices. To address this, the paper introduces "Our Metric" to measure test-retest stability and advocates for its integration into standard evaluation procedures. Future work could explore further shuffling variations and additional perturbation types to comprehensively assess model robustness. This paper also underscores the importance of continuous refinement in benchmarking methodologies to ensure they evolve in tandem with advances in LLM capabilities.

By shedding light on the brittle nature of model performance under answer order variations, Gupta et al.'s work encourages the AI research community to adopt more rigorous and multi-faceted evaluation strategies, ultimately leading to the development of more reliable and generalizable AI systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.