Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
60 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Will we run out of data? Limits of LLM scaling based on human-generated data (2211.04325v2)

Published 26 Oct 2022 in cs.LG, cs.AI, cs.CL, cs.CV, and cs.CY

Abstract: We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in LLMing can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.

Citations (74)

Summary

  • The paper demonstrates that exponential dataset growth in ML may soon outpace human-generated data, with high-quality language data depleting by 2027.
  • Using historical trends and compute-optimal projections, the study reveals that available computational resources tightly constrain dataset scaling.
  • The findings emphasize the need for enhanced data efficiency and exploring synthetic and multimodal data as strategic alternatives.

Analysis of Dataset Scaling Limits in Machine Learning

The paper "Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning" authored by Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho explores the critical issue of dataset scaling in ML, particularly for NLP and computer vision. The paper investigates whether the relentless growth in training dataset sizes can be sustained in the long term, given the finite availability of data. This work utilizes historical growth patterns, compute-optimal dataset size projections, and models of data stock accumulation rates to provide insights into potential bottlenecks.

Key Findings

The authors present several critical projections and insights:

  • Exponential Dataset Growth: Historically, language datasets have grown at a rate exceeding 50% per year, with current sizes reaching approximately 2e12 words as of October 2022. The stock of high-quality language data is projected to be depleted between 2023 and 2027, while low-quality language and image datasets are expected to last until 2030-2050 and 2030-2060, respectively.
  • Data Accumulation Rates: The paper reports that the stock of language data increases at a rate of around 7% per year, but this rate is expected to decelerate to roughly 1% by 2100. Vision data follows a similar trend, currently growing at approximately 8%, which is also projected to slow down to around 1% by 2100.
  • Compute Constraints: Projections based on compute availability indicate a slower growth trajectory for dataset sizes. Compute-optimal dataset sizes show a strong coupling between available computational resources and data requirements, suggesting that the exhaustion of data stocks will be constrained by the growth of computational power.

Implications and Future Directions

The findings underscore the potential bottlenecks in data availability, which could significantly impact the advancement of ML models:

  1. Data Efficiency: A crucial takeaway from the paper is the necessity for improvements in data efficiency. Techniques that maximize the utility of smaller datasets or enhance the quality of data will be vital to bypass the impending scarcity.
  2. Synthetic Data: The possibility of using synthetic data to supplement or replace real-world data emerges as a practical solution. However, the efficacy and cost associated with synthetic data remain uncertain.
  3. Multimodal Models: The integration of multimodal models could leverage different types of data, such as combining text and image data, to mitigate the limitations imposed by single-modality data stocks.
  4. Economic and Technical Factors: Large-scale economic or technical changes, such as advancements in data generation technologies, wide-scale data collection initiatives, or shifts in the data landscape due to new applications, could alter the projections.
  5. Algorithmic Innovations: Future algorithmic breakthroughs in training efficiency or data augmentation could extend the lifespan of existing data stocks. Additionally, refined methodologies for high-quality data extraction from low-quality sources may also delay data exhaustion.

Conclusion

The paper presents a comprehensive analysis of dataset scaling limits, highlighting that the exponential growth of training datasets in ML might soon hit a ceiling due to finite data availability. This projection emphasizes the need for innovations in data efficiency and the exploration of new data sources. While the exhaustion of high-quality language data is imminent, the broader implications for AI development call for strategic measures to ensure sustained progress. The paper provides a detailed assessment, laying the groundwork for further research into mitigating the potential data bottleneck in ML.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com