Emergent Mind

Abstract

Large language models (LLM) have become state of the art in many benchmarks and conversational LLM applications like ChatGPT are now widely used by the public. Those LLMs can be used to generate large amounts of content which is posted on the internet to various platforms. As LLMs are trained on datasets usually collected from the internet, this LLM-generated content might be used to train the next generation of LLMs. Therefore, a self-consuming training loop emerges in which new LLM generations are trained on the output from the previous generations. We empirically study this self-consuming training loop using a novel dataset to analytically and accurately measure quality and diversity of generated outputs. We find that this self-consuming training loop initially improves both quality and diversity. However, after a few generations the output inevitably degenerates in diversity. We find that the rate of degeneration depends on the proportion of real and generated data.

Overview

  • LLMs often recycle generated content into new training cycles, creating a self-consuming loop.

  • Research used logical expressions to assess syntactic and semantic accuracy over generations.

  • Early stages of the loop showed improved output quality and diversity.

  • Content diversity declines over time, affected by the mix of real and synthetic training data.

  • Long-term use of LLM data compromises output variety, indicating a need for fresh data incorporation.

Introduction

The environment in which LLMs are trained and subsequently deployed is a dynamic one, closely tied to the vast and ever-evolving content on the internet. A notable phenomenon is how the content generated by LLMs is often recycled back into the data pools from which new LLM generations are trained, leading to what can be described as a "self-consuming training loop". This process raises questions about the long-term effects on the quality and diversity of the output produced by successive LLM generations.

Analyzing the Self-Consuming Training Loop

An investigative study has been conducted to understand the ramifications of LLMs that are part of this cycle. The researchers adopted an empirical approach, constructing a novel dataset comprised of logical expressions. Unlike natural language, logical expressions allow for straightforward analytical verification of both syntactic and semantic accuracy, providing a clear measure of the correctness of LLM-generated content.

Quality and Diversity Over Generations

The study revealed that, initially, this self-consuming training loop does appear to enhance the quality and diversity of the outputs. However, it becomes clear that after a few iterations of this cycle, the diversity of the content starts to decline, irrespective of the data cycle (the method by which new data is incorporated for each LLM training generation). The degree of this decline in diversity was also noted to be dependent on the mix of real and synthetic data used in training.

Implications and Future Work

A significant implication drawn from the study is that while utilizing LLM-generated data can improve correctness in the short term, it may significantly compromise the variety of outputs over time. This points to the necessity for caution and deeper examination of training data by researchers and developers to avoid a potential decrease in the utility and performance of LLMs. The study suggests further research is needed to explore how the introduction of fresh data in each generation and tactics like fine-tuning might affect such self-consuming training loops.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.