To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Published 22 May 2023 in cs.LG, cs.AI, and cs.CL | (2305.13230v2)

Abstract: Recent research has highlighted the importance of dataset size in scaling LLMs. However, LLMs are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.

Abstract PDF Upgrade to Chat

Citations (60)

View on Semantic Scholar

Summary

The paper demonstrates that repeated pre-training data leads to overfitting and multi-epoch degradation, particularly in larger LLMs.
It reveals that increasing dataset size and applying dropout regularization effectively mitigate performance declines.
The study showcases Mixture-of-Experts models as a cost-effective proxy for tuning hyperparameters in dense LLM architectures.

Insights from Scaling LLMs under Token-Crisis

The paper "To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis" presents an empirical investigation of training LLMs in scenarios where token availability is a bottleneck, termed as "token-crisis." The study addresses the critical question of how to maintain LLM performance when the available pre-training data no longer scales at a pace sufficient to meet the models' data hunger.

The authors explore the simplistic yet contentious practice of repeating pre-training data for multiple epochs to extend LLM training. Traditionally, LLMs consume vast amounts of high-quality text data from the internet for pre-training. However, recent trends indicate that this data source may be reaching its limits. This study systematically investigates the ramifications of repeating pre-training data on model performance and explores mitigation strategies for the observed degradation, specifically multi-epoch degradation.

Key Findings

Consequences of Data Repetition: Repeating pre-training data can lead to substantial overfitting, particularly when data is scarce. This multi-epoch degradation suggests that additional epochs of training do not contribute positively beyond a certain point. Larger models were found to be more susceptible to this issue than smaller counterparts.
Factors Influencing Degradation: The study identifies dataset size as a crucial factor in alleviating multi-epoch degradation. Larger datasets can mitigate performance degradation more effectively than improvements in dataset quality alone, contradicting some assumptions in the community surrounding the efficacy of high-quality data.
Regularization Techniques: Among various regularizations, dropout emerged as most effective in counteracting overfitting, although it requires careful tuning at larger scales. This study suggests adapting dropout rates only after certain epochs, offering a strategic balance between model learning efficiency and overfitting mitigation.
Predictive Potential of MoE Models: An intriguing discovery is the potential of Mixture-of-Experts (MoE) models to predict the behavior of larger dense models, providing a computationally cheaper proxy for hyper-parameter tuning. This strategy offers significant cost savings, with MoE models providing insights without the expenses associated with scaling larger dense models like GPT-3.
Training Objectives and Overfitting: Diverse objectives, representing mixtures of traditional goals like masked language modeling and next-token prediction, were evaluated. The UL2 framework appeared more prone to rapid learning and memorization, exacerbating degradation under constrained data conditions.

Implications and Future Directions

The authors present a compelling case for reevaluating pre-training approaches under resource constraints. The exploration of dropout and MoE models paves the way for more resource-efficient LLM developments. The identified efficacy of dropout challenges existing norms in large model training strategies, suggesting revised procedures for introducing such regularizations.

The potential applications of these insights extend to enhancing the accessibility of LLMs across languages beyond English, where data scarcity is even more pronounced. Future research should address how different architectures and objectives can be optimized for efficient learning with limited high-quality data.

Ultimately, this paper underscores the importance of adaptability in LLM training protocols as explicit limitations in data availability loom closer on the horizon. By harnessing robust regularization techniques and employing insightful predictive models such as MoE, we can continue to extract value from LLMs amid the pressures of data saturation and compute constraints. The research points toward a balanced approach to model scaling and data utilization, critical for sustainable advances in AI and natural language processing domains.

Markdown