Investigating Data Contamination for Pre-training Language Models (2401.06059v1)

Published 11 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on LLM capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

References (37)

Citations (44)

View on Semantic Scholar

Summary

The paper systematically investigates how text and ground-truth contamination in GPT-2 pre-training affects downstream performance, revealing task-specific variations.
It critiques prevailing n-gram-based contamination definitions, demonstrating their limitations in capturing semantic overlaps and susceptibility to false positives and negatives.
Empirical results highlight a U-shaped performance curve with repeated contamination, underscoring risks of overfitting and the need for robust auditing protocols.

Data Contamination in Pre-training LLMs: Empirical Analysis and Methodological Critique

Introduction

This paper presents a systematic investigation into the effects of data contamination in the pre-training corpora of LLMs, with a particular focus on the leakage of evaluation datasets. The authors address a critical gap in the literature: while contamination has been acknowledged as a potential confounder in LLM evaluation, most prior work has relied on post-hoc, n-gram-based definitions and evaluation-level analyses, rather than direct, pre-training-level interventions. By pre-training GPT-2 models from scratch under controlled contamination scenarios, the paper provides quantitative insights into how different forms and degrees of contamination affect downstream task performance, and critically evaluates the adequacy of prevailing contamination definitions.

Contamination Definitions and Experimental Design

The paper distinguishes between two primary forms of contamination:

Text Contamination: Inclusion of evaluation dataset input texts in the pre-training corpus.
Ground-truth Contamination: Inclusion of input texts, prompts, and corresponding ground-truth labels/answers.

The authors critique the dominant n-gram-based contamination definitions (e.g., PaLM, Llama 2), highlighting their limitations in capturing semantic duplication and their susceptibility to both false positives (due to common n-grams) and false negatives (due to paraphrasing). The experimental setup involves pre-training GPT-2-small (124M) and GPT-2-large (774M) models on subsets of the Pile, with controlled injection of contaminated data, and evaluation on canonical NLP benchmarks (SST-2, MMLU, CNN/Daily Mail, SQuAD).

Quantitative Effects of Contamination

Empirical results demonstrate that both text and ground-truth contamination lead to measurable improvements in downstream task performance relative to models trained on uncontaminated corpora. However, the magnitude and nature of these improvements are task-dependent:

Ground-truth contamination generally yields larger performance gains than text contamination, especially for tasks requiring instruction following or answer generation (e.g., CNN/Daily Mail summarization, SQuAD QA).
For classification tasks with short inputs (e.g., SST-2), ground-truth contamination does not consistently outperform text contamination, likely due to the limited impact of prompt/label information and increased sensitivity to prompt formatting.

The paper further reveals that repeated contamination can induce a U-shaped performance curve: initial repetitions of contaminated data improve performance, but excessive repetition (e.g., >10x) leads to degradation, sometimes below baseline. This non-monotonicity suggests overfitting and memorization effects, and underscores the importance of quantifying not just the presence but the frequency of contamination in large-scale corpora.

Critique of N-gram-based Contamination Definitions

The authors systematically filter pre-training corpora using n-gram and Llama 2 definitions, removing up to 30% of tokens labeled as contaminated. The results show no consistent relationship between the proportion of removed tokens and model performance, indicating that these definitions are insufficient for reliably identifying impactful contamination. The strictness of PaLM's definition, for example, results in negligible document removal, while Llama 2's token-based approach is highly sensitive to parameter choices and context.

The paper also replicates evaluation-level contamination analyses (categorizing evaluation samples as "clean", "dirty", etc.), finding that performance differences across categories are marginal and do not support claims of model robustness to contamination. This calls into question the validity of such categorical analyses for assessing contamination effects.

Scaling and Generalization

Experiments with GPT-2-large confirm that contamination effects persist at larger model and corpus scales, with ground-truth contamination still yielding significant performance improvements. However, the absolute performance remains below that of public GPT-2 checkpoints, highlighting the continued importance of pre-training data scale and diversity.

The authors also note that contamination effects are dataset-dependent; for example, in AG News, contamination does not yield substantial gains, possibly due to pre-existing overlap between the evaluation set and the subsampled corpus.

Implications and Future Directions

The findings have several important implications:

Current contamination detection methods are inadequate: N-gram-based and token-overlap definitions fail to capture semantic duplication and are easily evaded by paraphrasing.
Ground-truth contamination is underappreciated: Inclusion of prompts and answers in pre-training data can substantially inflate evaluation metrics, especially for generative tasks.
Repeated contamination and memorization require further paper: The observed U-shaped performance curve suggests complex interactions between memorization, generalization, and overfitting in LLMs.
Evaluation-level analyses are insufficient: Robustness claims based on categorical evaluation splits are not supported by pre-training-level interventions.

Future work should focus on developing more semantically informed contamination definitions (e.g., embedding-based, syntax-aware, or paraphrase detection), and on scalable, corpus-level contamination auditing tools. There is also a need for standardized protocols for contamination reporting in LLM research, and for further investigation into the relationship between contamination, memorization, and privacy risks.

Conclusion

This paper provides a rigorous empirical analysis of data contamination in LLM pre-training, demonstrating that both text and ground-truth contamination can artificially inflate downstream performance, and that prevailing contamination definitions are insufficient for reliable detection. The results underscore the necessity for more precise contamination auditing and reporting, and for the development of robust, semantically informed detection methodologies. These findings have direct implications for the evaluation, deployment, and trustworthiness of LLMs in both research and production settings.