Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Investigating Data Contamination for Pre-training Language Models (2401.06059v1)

Published 11 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on LLM capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Palm 2 technical report, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Membership inference attacks from first principles, 2022.
  4. Quantifying memorization across neural language models, 2023.
  5. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
  6. Palm: Scaling language modeling with pathways, 2022.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  8. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
  9. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  10. Time travel in llms: Tracing data contamination in large language models, 2023.
  11. Textbooks are all you need, 2023.
  12. Measuring massive multitask language understanding, 2021.
  13. An empirical analysis of compute-optimal large language model training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  14. Preventing generation of verbatim memorization in language models gives a false sense of privacy. In Proceedings of the 16th International Natural Language Generation Conference, pages 28–53. Association for Computational Linguistics, 2023.
  15. Membership inference attack susceptibility of clinical language models, 2021.
  16. Pretraining language models with human preferences, 2023.
  17. Textbooks are all you need ii: phi-1.5 technical report, 2023.
  18. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  19. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  20. Data contamination: From memorization to exploitation, 2022.
  21. Membership inference on word embedding and beyond, 2021.
  22. Membership inference attacks against language models via neighbourhood comparison. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 11330–11343, Toronto, Canada, July 2023. Association for Computational Linguistics.
  23. Quantifying privacy risks of masked language models using membership inference attacks, 2022.
  24. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, August 2016. Association for Computational Linguistics.
  25. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  26. Modern neural networks generalize on small data sets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Proving test set contamination in black box language models, 2023.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  31. Detecting pretraining data from large language models, 2023.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
  33. Llama: Open and efficient foundation language models, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models, 2023.
  35. How far can camels go? exploring the state of instruction tuning on open resources, 2023.
  36. Rethinking benchmark and contamination for language models with rephrased samples, 2023.
  37. Towards a unified multi-dimensional evaluator for text generation, 2022.
Citations (44)

Summary

  • The paper systematically investigates how text and ground-truth contamination in GPT-2 pre-training affects downstream performance, revealing task-specific variations.
  • It critiques prevailing n-gram-based contamination definitions, demonstrating their limitations in capturing semantic overlaps and susceptibility to false positives and negatives.
  • Empirical results highlight a U-shaped performance curve with repeated contamination, underscoring risks of overfitting and the need for robust auditing protocols.

Data Contamination in Pre-training LLMs: Empirical Analysis and Methodological Critique

Introduction

This paper presents a systematic investigation into the effects of data contamination in the pre-training corpora of LLMs, with a particular focus on the leakage of evaluation datasets. The authors address a critical gap in the literature: while contamination has been acknowledged as a potential confounder in LLM evaluation, most prior work has relied on post-hoc, n-gram-based definitions and evaluation-level analyses, rather than direct, pre-training-level interventions. By pre-training GPT-2 models from scratch under controlled contamination scenarios, the paper provides quantitative insights into how different forms and degrees of contamination affect downstream task performance, and critically evaluates the adequacy of prevailing contamination definitions.

Contamination Definitions and Experimental Design

The paper distinguishes between two primary forms of contamination:

  • Text Contamination: Inclusion of evaluation dataset input texts in the pre-training corpus.
  • Ground-truth Contamination: Inclusion of input texts, prompts, and corresponding ground-truth labels/answers.

The authors critique the dominant n-gram-based contamination definitions (e.g., PaLM, Llama 2), highlighting their limitations in capturing semantic duplication and their susceptibility to both false positives (due to common n-grams) and false negatives (due to paraphrasing). The experimental setup involves pre-training GPT-2-small (124M) and GPT-2-large (774M) models on subsets of the Pile, with controlled injection of contaminated data, and evaluation on canonical NLP benchmarks (SST-2, MMLU, CNN/Daily Mail, SQuAD).

Quantitative Effects of Contamination

Empirical results demonstrate that both text and ground-truth contamination lead to measurable improvements in downstream task performance relative to models trained on uncontaminated corpora. However, the magnitude and nature of these improvements are task-dependent:

  • Ground-truth contamination generally yields larger performance gains than text contamination, especially for tasks requiring instruction following or answer generation (e.g., CNN/Daily Mail summarization, SQuAD QA).
  • For classification tasks with short inputs (e.g., SST-2), ground-truth contamination does not consistently outperform text contamination, likely due to the limited impact of prompt/label information and increased sensitivity to prompt formatting.

The paper further reveals that repeated contamination can induce a U-shaped performance curve: initial repetitions of contaminated data improve performance, but excessive repetition (e.g., >10x) leads to degradation, sometimes below baseline. This non-monotonicity suggests overfitting and memorization effects, and underscores the importance of quantifying not just the presence but the frequency of contamination in large-scale corpora.

Critique of N-gram-based Contamination Definitions

The authors systematically filter pre-training corpora using n-gram and Llama 2 definitions, removing up to 30% of tokens labeled as contaminated. The results show no consistent relationship between the proportion of removed tokens and model performance, indicating that these definitions are insufficient for reliably identifying impactful contamination. The strictness of PaLM's definition, for example, results in negligible document removal, while Llama 2's token-based approach is highly sensitive to parameter choices and context.

The paper also replicates evaluation-level contamination analyses (categorizing evaluation samples as "clean", "dirty", etc.), finding that performance differences across categories are marginal and do not support claims of model robustness to contamination. This calls into question the validity of such categorical analyses for assessing contamination effects.

Scaling and Generalization

Experiments with GPT-2-large confirm that contamination effects persist at larger model and corpus scales, with ground-truth contamination still yielding significant performance improvements. However, the absolute performance remains below that of public GPT-2 checkpoints, highlighting the continued importance of pre-training data scale and diversity.

The authors also note that contamination effects are dataset-dependent; for example, in AG News, contamination does not yield substantial gains, possibly due to pre-existing overlap between the evaluation set and the subsampled corpus.

Implications and Future Directions

The findings have several important implications:

  • Current contamination detection methods are inadequate: N-gram-based and token-overlap definitions fail to capture semantic duplication and are easily evaded by paraphrasing.
  • Ground-truth contamination is underappreciated: Inclusion of prompts and answers in pre-training data can substantially inflate evaluation metrics, especially for generative tasks.
  • Repeated contamination and memorization require further paper: The observed U-shaped performance curve suggests complex interactions between memorization, generalization, and overfitting in LLMs.
  • Evaluation-level analyses are insufficient: Robustness claims based on categorical evaluation splits are not supported by pre-training-level interventions.

Future work should focus on developing more semantically informed contamination definitions (e.g., embedding-based, syntax-aware, or paraphrase detection), and on scalable, corpus-level contamination auditing tools. There is also a need for standardized protocols for contamination reporting in LLM research, and for further investigation into the relationship between contamination, memorization, and privacy risks.

Conclusion

This paper provides a rigorous empirical analysis of data contamination in LLM pre-training, demonstrating that both text and ground-truth contamination can artificially inflate downstream performance, and that prevailing contamination definitions are insufficient for reliable detection. The results underscore the necessity for more precise contamination auditing and reporting, and for the development of robust, semantically informed detection methodologies. These findings have direct implications for the evaluation, deployment, and trustworthiness of LLMs in both research and production settings.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 198 likes.

Upgrade to Pro to view all of the tweets about this paper: