The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.
DoReMi introduces a novel method for optimizing language model (LM) pretraining by adjusting the mixture proportions of pretraining data domains without prior downstream task knowledge.
The method leverages Group Distributionally Robust Optimization to dynamically modify domain weights based on loss discrepancies across domains, focusing on improving less optimal domains.
Empirical results using The Pile and GLaM datasets showed DoReMi's ability to reduce perplexity, enhance downstream accuracy, and achieve baseline accuracy faster with less computational cost.
DoReMi's approach provides theoretical and practical insights for improving LM training efficiency and performance, suggesting potential future research directions for further optimization.
Training data composition significantly influences the performance of language models (LMs). This paper introduces a novel method called Domain Reweighting with Minimax Optimization (DoReMi) which optimizes mixture proportions of pretraining data domains to enhance LM performance across a broad range of tasks. By deploying DoReMi, the process dynamically adjusts domain weights (mixture proportions) without requiring prior knowledge of downstream task specifics, thus streamlining the pretraining efficiency of LMs. DoReMi's validation is presented through experiments on The Pile and the GLaM dataset, showcasing its capacity to refine perplexity scores, hasten convergence, and equate or outperform models trained with downstream task-tuned domain weights, all with a fraction of the computational overhead typically entailed.
DoReMi's procedure is initiated by training a small reference model using initial reference domain weights, which could be heuristically selected or proportional to domain sizes. Subsequently, it leverages Group Distributionally Robust Optimization (Group DRO) over domains to train a proxy model. Unlike typical DRO outcomes that directly use the robust model, DoReMi instead extracts the optimized domain weights for retuning the larger, full-sized model.
Key to DoReMi's process is its adaptability to dynamically modify domain weights based on loss discrepancies (excess loss) across domains. This adaptation ensures a training focus on improving domains with suboptimal learning, guaranteeing a balanced performance uplift across all domains rather than overfitting to a set of specific domains.
DoReMi's empirical evaluation reveals its prowess in improving average downstream accuracy and accelerating baseline accuracy achievement with considerably fewer training steps, specifically under the diverse domains of The Pile and GLaM datasets.
This research has multifaceted implications for the broader LM and AI community. Theoretically, it elucidates the significant impact of data domain proportions on model performance and presents a robust method to optimize these proportions in a principled manner. Practically, DoReMi offers a tangible strategy to enhance the training efficiency of large-scale LMs without the prohibitive computational cost often associated with domain weight optimization using downstream tasks.
The paper speculates on several avenues for further research, including exploring the effects of varying proxy model sizes, and the feasibility of scaling domain weights across different model sizes. Additionally, iterated DoReMi presents an intriguing framework for improving domain weight optimization through successive refinements, poising it as a potentially fruitful area for exploration.
DoReMi marks a significant step forward in data-centric approaches to LM training, underscoring the critical role of training data composition. By efficiently optimizing domain weights in a task-agnostic manner, it promises enhanced LM performance and training efficiency—key metrics in the rapidly advancing field of generative AI and LLMs.