An Open Source Data Contamination Report for Large Language Models (2310.17589v3)
Abstract: Data contamination in model evaluation has become increasingly prevalent with the growing popularity of LLMs. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination analysis has become an crucial part of reliable model evaluation to validate results. However, existing contamination analysis is usually conducted internally by LLM developers and often lacks transparency and completeness. This paper presents an extensive data contamination report for over 15 popular LLMs across six popular multiple-choice QA benchmarks. We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models. Our experiments reveal varying contamination levels ranging from 1\% to 45\% across benchmarks, with the contamination degree increasing rapidly over time. Performance analysis of LLMs indicates that data contamination does not necessarily lead to increased model metrics: while significant accuracy boosts of up to 14\% and 7\% are observed on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is noted on contaminated MMLU. We also find larger models seem able to gain more advantages than smaller models on contaminated test sets.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Qwen technical report.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Piqa: Reasoning about physical commonsense in natural language.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
- Evaluating large language models trained on code.
- Quac: Question answering in context. arXiv preprint arXiv:1808.07036.
- Palm: Scaling language modeling with pathways.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Training verifiers to solve math word problems.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Measuring massive multitask language understanding.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks.
- Mistral 7b.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Benjamin Marie. 2023. The decontaminated evaluation of gpt-4. Accessed: 2023-07-28.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
- OpenAI. 2023. Gpt-4 technical report.
- OpenCompass. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
- NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Baichuan 2: Open large-scale language models.
- Yi. 2023. A series of large language models trained from scratch by developers at 01-ai. https://github.com/01-ai/Yi.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.