DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (2305.10429v4)

Published 17 May 2023 in cs.CL and cs.LG

Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect LLM (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

Citations (130)

View on Semantic Scholar

Summary

The paper introduces DoReMi, a novel adaptive method that optimizes domain weights to improve pretraining efficiency by dynamically balancing data mixtures.
It utilizes Group DRO and proxy models to adjust weights based on excess loss, leading to improved perplexity and a 6.5-point accuracy boost.
Empirical results on The Pile and GLaM datasets show DoReMi achieves baseline performance 2.6 times faster, significantly reducing computational cost.

Optimizing LLM Training Through Adaptive Data Domain Weighting

Introduction to DoReMi

Training data composition significantly influences the performance of LLMs (LMs). This paper introduces a novel method called Domain Reweighting with Minimax Optimization (DoReMi) which optimizes mixture proportions of pretraining data domains to enhance LM performance across a broad range of tasks. By deploying DoReMi, the process dynamically adjusts domain weights (mixture proportions) without requiring prior knowledge of downstream task specifics, thus streamlining the pretraining efficiency of LMs. DoReMi's validation is presented through experiments on The Pile and the GLaM dataset, showcasing its capacity to refine perplexity scores, hasten convergence, and equate or outperform models trained with downstream task-tuned domain weights, all with a fraction of the computational overhead typically entailed.

Methodology Breakdown

DoReMi's procedure is initiated by training a small reference model using initial reference domain weights, which could be heuristically selected or proportional to domain sizes. Subsequently, it leverages Group Distributionally Robust Optimization (Group DRO) over domains to train a proxy model. Unlike typical DRO outcomes that directly use the robust model, DoReMi instead extracts the optimized domain weights for retuning the larger, full-sized model.

Key to DoReMi's process is its adaptability to dynamically modify domain weights based on loss discrepancies (excess loss) across domains. This adaptation ensures a training focus on improving domains with suboptimal learning, guaranteeing a balanced performance uplift across all domains rather than overfitting to a set of specific domains.

Empirical Validation and Insights

DoReMi's empirical evaluation reveals its prowess in improving average downstream accuracy and accelerating baseline accuracy achievement with considerably fewer training steps, specifically under the diverse domains of The Pile and GLaM datasets.

On The Pile, DoReMi presents a notable reduction in perplexity across all domains, enhances average downstream accuracy by 6.5 percentage points on generative few-shot tasks, and reaches baseline downstream accuracy 2.6 times faster than the baseline configuration.
When applied to the GLaM dataset, an intriguing observation is DoReMi's ability to tune domain weights that achieve similar performance as weights explicitly optimized for downstream tasks - a process entirely independent of downstream task exposure.

Theoretical and Practical Implications

This research has multifaceted implications for the broader LM and AI community. Theoretically, it elucidates the significant impact of data domain proportions on model performance and presents a robust method to optimize these proportions in a principled manner. Practically, DoReMi offers a tangible strategy to enhance the training efficiency of large-scale LMs without the prohibitive computational cost often associated with domain weight optimization using downstream tasks.

Future Directions

The paper speculates on several avenues for further research, including exploring the effects of varying proxy model sizes, and the feasibility of scaling domain weights across different model sizes. Additionally, iterated DoReMi presents an intriguing framework for improving domain weight optimization through successive refinements, poising it as a potentially fruitful area for exploration.

Conclusion

DoReMi marks a significant step forward in data-centric approaches to LM training, underscoring the critical role of training data composition. By efficiently optimizing domain weights in a task-agnostic manner, it promises enhanced LM performance and training efficiency—key metrics in the rapidly advancing field of generative AI and LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/far__el/status/1770218968126820773

https://twitter.com/kenziyuliu/status/1816682637123293669

https://twitter.com/yang_ML/status/1878999936681001208

https://twitter.com/aryaman2020/status/1813707796980158971

https://twitter.com/stferret/status/1797267965252960611

https://twitter.com/schrutebuck1/status/1936859701159108930

YouTube

Show All Videos