Emergent Mind

Scaling Laws and Interpretability of Learning from Repeated Data

(2205.10487)
Published May 21, 2022 in cs.LG and cs.AI

Abstract

Recent LLMs have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in LLMs could lead to disproportionately large harms to performance.

Overview

  • This paper investigates the influence of repeated data on transformer model performance in NLP, discovering a double-descent phenomenon influencing model degradation.

  • A controlled experimental setup revealed that high repetition rates of data could significantly affect the model's memorization capacity and result in test loss equivalent to that of a model with significantly fewer parameters.

  • Induction heads, important for generalization, were found to be negatively affected by repeated data, impairing the model's generalization capabilities.

  • The presence of repeated data can severely degrade model performance, which is crucial for LLMs prone to overfitting on datasets like Wikipedia.

  • The study provides insights into the mechanical workings of how repeated data harms language model performance, connecting raw performance issues to the underlying computations.

Introduction

Transformer models are increasingly indispensable in NLP. Yet, despite impressive performance on various tasks and benchmarks, their functioning and limitations are not entirely understood. This paper investigates the effects of training on datasets containing repeated data. Repeated data instances in training sets are common due to the imperfect nature of data deduplication processes. Previous literature reports contrasting effects of this repetition, with some studies indicating negligible impact and others identifying substantial performance degradation. The present research approaches the topic by studying the influence of repeated data on model performance through two distinct lenses: the macroscopic perspective of scaling laws and the microscopic view of mechanistic interpretability.

Scaling Laws

Experimental setup involved training transformer language models on a composition of mostly unique data with a fraction of repeated instances. The controlled variation of the repeated data ratio, model size, and tokens exposed to repetition allowed for the observation of a double-descent phenomenon—the existence of a certain range of repetition frequencies where model performance detrimentally peaks. The study reveals that significant damage occurs when a portion of data is repeatable enough to consume a substantial fraction of a model's capacity for memorization. For instance, it was observed that an 800-million-parameter transformer model exposed to 0.1% of data repeated 100 times encountered test loss nearly equivalent to a model with half the parameter count.

Interpretability

The study also examines how this repeated data affects individual components of the models known to be associated with generalization. Induction heads, seen as units that support pattern completion by concatenating sequences, were disproportionately harmed by repeated data. Degradation was observed in tasks, such as copying, which reflect generalization capabilities independent of content. This disproportionate impact underscores the significance of the transition from generalization to memorization within the model.

Implications and Diagnostics

Compellingly, the research suggests that even a small fraction of repeated data can result in disproportionately severe performance degradations. This is particularly important for LLMs that risk overfitting high-quality distributions such as Wikipedia. The double-descent phenomenon presents a practical diagnostic for identifying when repeated data is likely causing harm to training. Moreover, mechanistic interpretations connect damage to experimentally observed model behaviors and the underlying neural computations, guiding insights into model learning dynamics.

Conclusion

In summary, the paper postulates a hypothesis for the observed harms in language model performance due to repetition in training datasets. It argues for the presence of double descent as a phenomenon worsened by overfitting on repeated subsets of data. It also demonstrates the pronounced adverse effect on induction heads and copying, behaviors synonymous with generalization. These findings forge a conceptual bridge between raw model performance and detailed mechanistic workings, proving a thorough understanding of the degradation mechanisms due to repeated data in language models.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.