Emergent Mind

Abstract

Pretraining datasets for LLMs have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)$\unicode{x2014}$a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

Impact of Domain Upsampling duration on model performance across different benchmarks and token counts.

Overview

  • The paper introduces 'domain upsampling,' a technique that strategically increases the representation of domain-specific datasets towards the end of training, leading to significant performance improvements on specific benchmarks for LLMs.

  • An extensive ablation study was conducted to find the optimal duration for domain upsampling, revealing that a balance between 10%-20% of the total training time achieves the best results without compromising the model's general performance.

  • The approach not only enhances targeted benchmark performances but also provides insights into the utility of individual datasets in model training, thereby offering a more cost-effective method for enhancing LLMs.

Performance Gains from Domain Upsampling at the End of Training: A Summary

LLMs often rely on diverse pretraining datasets to ensure robust performance across various benchmarks. These datasets typically include vast amounts of CommonCrawl (CC) web data complemented by smaller, domain-specific sources. Optimizing the balance between these diverse data sources, however, is both expensive and computationally intensive, particularly at large FLOP scales. In the discussed paper, the authors introduce the concept of "domain upsampling" at the end of training, a technique that strategically increases the representation of domain-specific datasets to enhance model performance on specific benchmarks while maintaining general capabilities.

Key Contributions

The authors present several notable contributions:

  1. Baseline Data Mix Construction: A baseline data mix was devised using publicly available datasets, structured into four broad categories: Large-Scale CC, Small-Scale CC, Domain Specific data, and Code datasets. These proportions were chosen heuristically to maintain a balance between information density and diversity.
  2. Implementation of Domain Upsampling: Domain upsampling was introduced as a pretraining intervention where domain-specific datasets are upsampled in the final stages of training. This method demonstrated performance improvements up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval for a 7B model.
  3. Ablation Study on Duration: An extensive ablation study examined the duration of domain upsampling from 5% to 30% of the training period. Optimal results were achieved with 10%-20% upsampling, carefully balancing general language modeling capabilities and performance on targeted benchmarks.
  4. Characterization of Dataset Utility: The paper utilized domain upsampling as a cost-effective method to characterize the utility of individual datasets on model performance. For instance, removing math-heavy data subsets during the upsampling phase provided insights into their contribution to benchmarks like GSM8K.

Training Details

The experiments were conducted on 7 billion parameter models trained for 1 trillion tokens, using the MPT architecture. Key hyperparameters included the LionW optimizer, a learning rate of 0.00012, and an inverse square root learning schedule. Evaluations were conducted using the Gauntlet v0.3, which aggregates performance across 35 popular in-context learning tasks.

Results

Baseline Model Performance

The baseline data mix demonstrated competitive performance relative to Llama-2 models, with error rates on MMLU, GSM8K, HumanEval, and the Gauntlet v0.3 Core Average all lying on or below the scaling line of the Llama-2 models. Notably, the baseline model outperformed the Llama-2 7B model on GSM8K and HumanEval despite being trained on half the number of tokens (1T vs. 2T).

Impact of Domain Upsampling

Domain upsampling applied during the final 20% of training notably improved model performance across challenging benchmarks, achieving scores competitive with Llama-2 (7B) but with approximately half the training FLOPs. This intervention particularly boosted scores on GSM8K and HumanEval by 8.26 pp and 6.17 pp, respectively.

Ablation Study

Ablating the duration of domain upsampling revealed that extending upsampling beyond 20% of training peaked performance on specific tasks but introduced trade-offs in general language modeling capabilities. Shorter upsampling durations (5%-10%) provided optimal trade-offs, enhancing targeted benchmarks without compromising other domains.

Dataset Utility Characterization

Domain upsampling also proved useful in isolating the impact of individual datasets. For example, removing math-related datasets during the upsampling phase resulted in lower performance on math and reasoning benchmarks, validating their crucial role in these areas.

Implications and Future Directions

The implications of this research are multifaceted:

  • Cost Efficiency: Domain upsampling offers a cost-effective approach to enhance model performance on targeted benchmarks, potentially steering future experimentation with pretraining datasets.
  • Dataset Characterization: This technique allows researchers to isolate and understand the impact of specific datasets on model capabilities, guiding more informed dataset selection in pretraining.
  • Optimizing Pretraining Mixtures: The findings suggest that domain upsampling can navigate the trade-off between general-purpose capabilities and domain-specific improvements, offering a scalable strategy for pretraining LLMs.

Future work could refine the domain upsampling proportions, further exploiting their potential to enhance performance across broader benchmarks. Additionally, integrating domain upsampling with alternative dataset optimization algorithms could expand its utility in pretraining LLMs at even larger scales.

By ensuring comprehensive yet cost-effective improvements in LLM pretraining, domain upsampling represents a significant step forward in the strategic utilization of diverse data sources, yielding models with enhanced performance across diverse tasks.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.