A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2405.00332v4)
Abstract: LLMs have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 8%, with several families of models showing evidence of systematic overfitting across almost all model sizes. Further analysis suggests a positive relationship (Spearman's r2 = 0.36) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that some models may have partially memorized GSM8k. Nevertheless, many models, especially those on the frontier, show minimal signs of overfitting, and all models broadly demonstrate generalization to novel math problems guaranteed to not be in their training data.
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, April 2024. URL http://arxiv.org/abs/2404.14219. arXiv:2404.14219 [cs].
- Llemma: An Open Language Model For Mathematics, March 2024. URL http://arxiv.org/abs/2310.10631. arXiv:2310.10631 [cs].
- Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs, February 2024. URL http://arxiv.org/abs/2402.03927. arXiv:2402.03927 [cs].
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model, April 2022. URL http://arxiv.org/abs/2204.06745. arXiv:2204.06745 [cs].
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Quantifying Memorization Across Neural Language Models, March 2023. URL http://arxiv.org/abs/2202.07646. arXiv:2202.07646 [cs].
- Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
- Training Verifiers to Solve Math Word Problems, November 2021. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168 [cs].
- A framework for few-shot language model evaluation, December 2023a. URL https://zenodo.org/records/10256836. tex.version: v0.4.0.
- PAL: Program-aided Language Models, January 2023b. URL http://arxiv.org/abs/2211.10435. arXiv:2211.10435 [cs].
- Textbooks Are All You Need, October 2023. URL http://arxiv.org/abs/2306.11644. arXiv:2306.11644 [cs].
- Measuring Massive Multitask Language Understanding, January 2021a. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs].
- Measuring Mathematical Problem Solving with the MATH Dataset. NeurIPS, 2021b.
- Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.308. URL https://aclanthology.org/2023.emnlp-main.308.
- Mistral 7B, October 2023. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
- Mixtral of Experts, January 2024. URL http://arxiv.org/abs/2401.04088. arXiv:2401.04088 [cs].
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, April 2024. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs].
- Data Contamination: From Memorization to Exploitation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.18. URL https://aclanthology.org/2022.acl-short.18.
- GPT-4 Technical Report, March 2024. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
- Language Models are Unsupervised Multitask Learners. page 24, 2019.
- Do ImageNet Classifiers Generalize to ImageNet?, June 2019. URL http://arxiv.org/abs/1902.10811. arXiv:1902.10811 [cs, stat].
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark, November 2023. URL https://arxiv.org/abs/2311.12022v1.
- NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722.
- Detecting Pretraining Data from Large Language Models, March 2024. URL http://arxiv.org/abs/2310.16789. arXiv:2310.16789 [cs].
- Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap, February 2024. URL http://arxiv.org/abs/2402.19450. arXiv:2402.19450 [cs].
- Gemini: A Family of Highly Capable Multimodal Models, April 2024. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs].
- LLaMA: Open and Efficient Foundation Language Models, February 2023a. URL http://arxiv.org/abs/2302.13971. arXiv:2302.13971 [cs].
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023b. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models, October 2023. URL http://arxiv.org/abs/2309.12284. arXiv:2309.12284 [cs].