D4: Improving LLM Pretraining via Document De-Duplication and Diversification (2308.12284v1)

Published 23 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Over recent years, an increasing amount of compute and data has been poured into training LLMs, usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.

Citations (68)

View on Semantic Scholar

Summary

The paper introduces the D4 strategy, which removes duplicate and redundant content to streamline LLM training datasets.
It demonstrates a 20% efficiency gain in pretraining by achieving similar perplexity in fewer training steps.
The method improves zero-shot accuracy by about 2% across 16 NLP tasks, enhancing overall model generalization.

Improving Pretraining of LLMs with D4 Strategy: A Focus on Deduplication and Diversification

Introduction to D4 Strategy

In this paper, we propose a novel data selection strategy, D4 (Document De-Duplication and Diversification), aimed at enhancing the pretraining phase of LLMs by optimizing the selection of training data. Conventional approaches to pretraining LLMs heavily rely on sourcing vast amounts of web data, leading to unintended inclusion of duplicated and semantically redundant content. This not only bloats the dataset but potentially degrades the model's performance. D4 addresses this issue by systematically reducing redundancy, thereby improving the quality of the training data, which directly translates into efficiency gains in model pretraining and boosted downstream task performance.

Our Contribution and Results

Data Selection Strategies: We explore various data selection strategies that mitigate the redundancy in training datasets. Despite existing methods like MinHash being prevalent for data de-duplication, our exploration suggests that further efficiencies can be obtained. Specifically, we propose D4, which leverages semantic deduplication to refine the training dataset further.
Efficiency Gains: Implementing D4 on a 6.7 billion parameter model and training on 100 billion tokens demonstrates an efficiency gain of about 20%, suggesting that the model achieves similar validation perplexity to the baseline in fewer training steps. Moreover, this method shows an improved average 0-shot downstream task accuracy by approximately 2% across 16 NLP tasks, suggesting that D4 not only speeds up training but potentially improves model generalization.
Application Across Regimes: Our findings are robust across both compute-limited and data-limited training regimes. Notably, when data becomes scarce, intelligently choosing what data to repeat during training (a practice facilitated by D4) showcases performance improvements over traditional approaches of adding new data or random repetition.

Theoretical and Practical Implications

The paper's contributions are significant for both theoretical understanding and practical applications of LLM training. Theoretically, it presents an advancement in data curation techniques, suggesting the importance of intelligent data selection for model efficiency and effectiveness. Practically, it outlines a viable path toward more sustainable and cost-effective training of LLMs, which is crucial given the escalating computational demands associated with these models.

Future Directions

This research opens several avenues for future exploration:

Combination with other data sources: Integrating D4 with a diverse mix of data sources could further enhance training datasets' quality and model robustness.
Scaling to larger models: Investigating the impacts of D4 on models larger than those tested could provide insights into its scalability and potential limits.
Embedding Space Exploration: Further exploration into different embedding spaces could yield even better strategies for deduplication and diversification, enhancing the overall efficiency of D4.

Conclusion

The D4 strategy represents a significant step forward in optimizing data selection for pretraining LLMs. By focusing on removing duplicated and semantically redundant content from training datasets, D4 allows for more efficient model training without compromising, and indeed improving, model performance. This research sets the stage for more nuanced and intelligent approaches to data curation, with the promise of making LLM pretraining more efficient and effective.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nsaphra/status/1744168803276050606

https://twitter.com/707746836052877316/status/1734260055971606914

YouTube

Show All Videos