Emergent Mind

Investigating Data Contamination for Pre-training Language Models

(2401.06059)
Published Jan 11, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

Comparison of model performance with varying contamination levels, showing ground-truth and contamination impacts on datasets.

Overview

  • LLMs may have compromised integrity due to inadvertent inclusion of evaluation data in training sets, known as data contamination.

  • The study differentiates between text contamination and ground truth contamination and explores how they influence LLM performance.

  • Researchers pre-trained GPT-2 models to control contamination levels and to evaluate the impact on LLMs with various forms and frequencies.

  • Data contamination boosts performance significantly, but with increasing frequency, it follows a U-shaped trend, suggesting a complex relationship between performance and contamination frequency.

  • The paper advocates for improved contamination definitions and assessment methods to ensure accurate evaluations of LLMs.

Introduction

LLMs are impressive computational entities that have shown remarkable performance on various tasks. While their sophisticated algorithms and expansive training sets account for much of their success, the issue of data contamination, specifically the inclusion of evaluation data in the training set, has started to raise concerns. This analysis is central to truly understanding the effectiveness and integrity of these models.

Contamination Implications

Recent observations point towards the potential of training data to contain slices of the very datasets used to evaluate these language models. The presence of such "contaminated" data can skew results, misleading us about a model's true capabilities. This study meticulously distinguishes between text contamination, where evaluation texts themselves are within the training set, and ground truth contamination, which includes both prompts and expected outputs used in evaluations. Understanding these distinctions and their effects is critical in evaluating the true performance of LLMs.

Experimental Approach

The researchers' approach involves the novel pre-training of GPT-2 models, with meticulous control over data contamination levels. The study considers different forms and repetition frequencies of contamination to assess its impact comprehensively. It also scrutinizes common n-gram-based contamination definitions found in existing LLM reports, revealing the potential inadequacy of such definitions for contamination detection and model assessment.

Findings and Recommendations

The findings of this study are revealing. Data contamination, especially when it includes ground truths, can significantly enhance models' performance, more so than mere text contamination. This paints a complex picture of the role of data purity on evaluation results. Remarkably, when contamination is repeatedly introduced, a U-shaped performance trend emerges, with model performance peaking and then declining as contamination increases. This suggests a nuanced relationship between performance and data contamination frequency.

In concluding, this research calls attention to the need for refined contamination definitions and robust assessment methodologies. The study's insights warn of the risks of data contamination and encourage the development of more stringent controls and transparency in LLM training environments.

Acknowledgements

Finally, the authors acknowledge the various supporters of this research, from DARPA and the National Science Foundation to institutions like Google Inc. and the Alfred P. Sloan Foundation. Without such multidimensional support, this insightful investigation into the intricate workings of LLMs and the effects of data contamination would not be possible.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.