Emergent Mind

Pre-training Small Base LMs with Fewer Tokens

(2404.08634)
Published Apr 12, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM: first inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1\%) of the raw pretraining data of the larger model. We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens (and a starting few layers of larger LM of 3B parameters); we do this using a single A6000 GPU for less than half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark, the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens. We investigate Inheritune in a slightly different setting where we train small LMs utilizing larger LMs and their full pre-training dataset. Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens. We analyze our recipe with extensive experiments and demonstrate it efficacy on diverse settings. Our code is available at https://github.com/sanyalsunny111/LLM-Inheritune.

"Inheritune methodology crafts multiple performant small base LMs from a large base LM, measured by MMLU scores."

Overview

  • Inheritune is introduced as a novel method for pre-training small base language models (LMs) by transferring a subset of transformer blocks from larger models, significantly reducing the computational resources required.

  • The methodology enables the development of compact models that achieve up to 89% of the downstream accuracy of their larger counterparts, while being trained on only 0.1% of the original data and utilizing a single GPU.

  • Scalability tests of Inheritune across different model sizes show a positive correlation between the number of inherited transformer layers and performance, proving its adaptability.

  • Extended analysis with up to 50B tokens and larger reference models up to 7B parameters demonstrates the method's effectiveness and improved performance in various scenarios.

Exploring Efficient Pre-Training Methods for Smaller Language Models with Inheritune

Introduction

Recent studies in the pre-training of small base language models (LMs) propose a nuanced method, termed Inheritune, focusing on leveraging a subset of transformer blocks from a larger LM to train a smaller model on a fraction of the original pre-training data. This paper examines the potential of Inheritune in developing compact but effective LMs with limited computational resources. It reports on experiments using a significantly smaller dataset size (0.1%) compared to the larger base model's training data, and a noteworthy reduction in the training time utilizing just a single GPU. The paper asserts the pre-trained smaller LM's competitive performance on multiple evaluation datasets and benchmarks, comparing favorably with base models of similar or larger sizes pre-trained from scratch on significantly more extensive datasets.

Method: Inheritune

Inheritune proposes an efficient approach for crafting smaller base LMs from larger reference models when only a small portion of the pre-training data is publicly available. Key steps include the inheritance of the first few transformer layers of a larger pre-trained model and further training the smaller model on a much smaller dataset. This method significantly reduces the compute and data requirements. The paper details implementing Inheritune using various reference models and data regimes, showing its versatility and effectiveness across different settings.

Results: Inheritune with 1B Data

Utilizing just 1B tokens for pre-training, Inheritune demonstrates a smaller base LM's ability to achieve considerable performance metrics across diverse evaluation datasets. Notably, this model achieves 89% of the downstream accuracy of its reference model on various tasks, despite the reference being double in size and trained on 1000 times more data. These findings underscore Inheritune's computational efficiency and potential in developing performant base models under stringent data and compute constraints.

Scaling Across Different Model Sizes

Inheritune's scalability is tested through the development of various small base LMs, derived from the same large base model but varying in size. Results indicate a positive relationship between the number of inherited transformer layers and model performance on the MMLU benchmark, highlighting Inheritune's adaptability to craft smaller LMs of varying capacities while maintaining competitive performance.

Additional Analysis with Larger Reference LMs and 50B Data

Extending the analysis to scenarios with more available data (50B tokens) and larger reference models (up to 7B parameters), the findings suggest enhanced performance of the smaller LMs. This extension solidifies Inheritune's applicability and effectiveness in a broader range of scenarios, showcasing improvements in model performance with increased data access and leveraging larger reference models.

Exploratory Analysis in the Presence of Full Pre-Training Data

In scenarios where the complete pre-training dataset is available, Inheritune's methodology exhibits the potential to match or exceed the performance of the larger reference model with a significantly smaller model. This section reaffirms the utility of Inheritune in efficiently reducing model size without compromising on validation loss, offering a pragmatic solution for situations where computational resources are limited but full pre-training data is accessible.

Implications

The Inheritune methodology presents an economic and computationally efficient pathway for the development of small base LMs, challenging the normative approaches that rely heavily on large datasets and extensive computational resources. It proposes a robust baseline for future pre-training endeavors aimed at developing smaller model variants and elucidates the notion of "sufficient depth," contributing to more thoughtful architectural decisions in LLM development.

Conclusion

Inheritune introduces a remarkably efficient approach for developing small base LMs through strategic inheritance of transformer blocks and smart utilization of limited data resources. Its success across various settings and model sizes emphasizes the potential to democratize access to performant LMs, paving the way for broader experimentation and innovation within the field of AI and natural language processing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube