Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Laws and Interpretability of Learning from Repeated Data (2205.10487v1)

Published 21 May 2022 in cs.LG and cs.AI

Abstract: Recent LLMs have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in LLMs could lead to disproportionately large harms to performance.

Citations (94)

Summary

  • The paper identifies a double-descent phenomenon when transformer models are trained with repeated data, linking repetition to significant performance degradation.
  • It demonstrates that even minimal repetition can consume a model's capacity, as seen with an 800M-parameter transformer exhibiting test loss similar to a smaller model.
  • The study reveals that repeated data disproportionately harms induction heads, undermining tasks that rely on effective generalization.

Introduction

Transformer models are increasingly indispensable in NLP. Yet, despite impressive performance on various tasks and benchmarks, their functioning and limitations are not entirely understood. This paper investigates the effects of training on datasets containing repeated data. Repeated data instances in training sets are common due to the imperfect nature of data deduplication processes. Previous literature reports contrasting effects of this repetition, with some studies indicating negligible impact and others identifying substantial performance degradation. The present research approaches the topic by studying the influence of repeated data on model performance through two distinct lenses: the macroscopic perspective of scaling laws and the microscopic view of mechanistic interpretability.

Scaling Laws

Experimental setup involved training transformer LLMs on a composition of mostly unique data with a fraction of repeated instances. The controlled variation of the repeated data ratio, model size, and tokens exposed to repetition allowed for the observation of a double-descent phenomenon—the existence of a certain range of repetition frequencies where model performance detrimentally peaks. The paper reveals that significant damage occurs when a portion of data is repeatable enough to consume a substantial fraction of a model's capacity for memorization. For instance, it was observed that an 800-million-parameter transformer model exposed to 0.1% of data repeated 100 times encountered test loss nearly equivalent to a model with half the parameter count.

Interpretability

The paper also examines how this repeated data affects individual components of the models known to be associated with generalization. Induction heads, seen as units that support pattern completion by concatenating sequences, were disproportionately harmed by repeated data. Degradation was observed in tasks, such as copying, which reflect generalization capabilities independent of content. This disproportionate impact underscores the significance of the transition from generalization to memorization within the model.

Implications and Diagnostics

Compellingly, the research suggests that even a small fraction of repeated data can result in disproportionately severe performance degradations. This is particularly important for LLMs that risk overfitting high-quality distributions such as Wikipedia. The double-descent phenomenon presents a practical diagnostic for identifying when repeated data is likely causing harm to training. Moreover, mechanistic interpretations connect damage to experimentally observed model behaviors and the underlying neural computations, guiding insights into model learning dynamics.

Conclusion

In summary, the paper postulates a hypothesis for the observed harms in LLM performance due to repetition in training datasets. It argues for the presence of double descent as a phenomenon worsened by overfitting on repeated subsets of data. It also demonstrates the pronounced adverse effect on induction heads and copying, behaviors synonymous with generalization. These findings forge a conceptual bridge between raw model performance and detailed mechanistic workings, proving a thorough understanding of the degradation mechanisms due to repeated data in LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com