Emergent Mind

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

(2405.00332)
Published May 1, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

LLMs have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.

Models exceeding 70% accuracy on GSM8k, highlighting overfitting in some (e.g., Mistral, Phi) and stability in others.

Overview

  • The GSM1k benchmark was created to test the real reasoning capabilities of LLMs using a set of grade-school level mathematical problems, aiming to avoid data contamination from previous benchmarks like GSM8k.

  • Several LLMs, including prominent ones like GPT-4, Gemini, and Claude, were examined using GSM1k. A significant finding was the notable drop in accuracy for some models compared to their performance on GSM8k, indicating possible overfitting.

  • The study suggests a need for more rigorous benchmarking in AI to truly assess model capabilities and support the development of LLMs that can generalize beyond memorized data, proposing controlled releases of benchmarks to prevent contamination.

Understanding Overfitting in LLMs through the GSM1k Benchmark

Introduction

The creation of the GSM1k benchmark seeks to address significant concerns in the AI research community regarding the genuine capabilities of LLMs. These concerns primarily revolve around whether the impressive performance of these models on existing mathematical benchmarks stems from actual reasoning or merely replicating answers from contaminated datasets. Let's dive deeper into what was uncovered.

Unveiling GSM1k: A New Benchmark

GSM1k serves as a fresh set of grade-school level mathematical problems, designed to parallel the well-known GSM8k benchmark in style and complexity yet created without using any LLMs to avoid data duplication. It comprises 1250 carefully crafted problems meant to evaluate the real reasoning capabilities of various LLMs.

  • Model Evaluation: The study tested both open-source and proprietary models on GSM1k, including well-known ones like GPT-4, Gemini, and Claude, among others.
  • Key Findings: There were notable drops in accuracy, up to 13%, particularly in models like the Phi and Mistral families, indicating possible overfitting when compared to performances on GSM8k.
  • Contrasting Performances: While some model families exhibited signs of overfitting, leading-edge models (e.g., Gemini/GPT/Claude) showed minimal to none, suggesting more robust generalization capabilities.

The Indicator of Overfitting

The research pinpointed a significant indicator of overfitting through a statistical analysis technique:

  • Probability Relationship: There is a positive correlation indicated by Spearman's ( r2=0.32 ) between a model's likelihood of regenerating examples from GSM8k and its variance in performance on GSM1k compared to GSM8k. This suggests a partial memorization of GSM8k within many models, a sign of overfitting.

Implications and Future Predictions

  • Practical Implications: Recognizing overfit models and understanding their limitations can lead to more honest assessments of LLM capabilities and guide more efficient use of resources in model training and development.
  • Theoretical Advances: These findings push the understanding of "generalization" within AI, prompting more rigorous testing environments that better measure true model capability beyond memorized data.
  • Future of AI Benchmarks: The study proposes a not-yet-public release of GSM1k to avoid further contamination. The future could see similar controlled releases guiding the development of more challenging and contamination-free benchmarks.

Model Capabilities Beyond Overfitting

Interestingly, the study also highlights an essential nuance in the debate on AI's reasoning abilities:

  • Generalization Skills: Despite reductions in performance metrics due to potential overfitting, models like Phi and Mistral still perform significantly well on GSM1k, suggesting they retain a strong capability to generalize beyond memorized data.

In conclusion, while the research from GSM1k brings to light the serious issue of overfitting in evaluating LLMs, it also presents a complex but hopeful view of the potential for these models to develop genuine reasoning abilities. The trajectory for future research and development, spurred by findings like these, likely holds both enhanced model training methods and more robust benchmarking tools that can accurately measure and foster true AI capabilities.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube