Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training (2405.15319v2)

Published 24 May 2024 in cs.CL and cs.AI

Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://LLM-stacking.github.io.

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that depthwise stacking (G_stack) accelerates pre-training by up to 54.5% compared to training from scratch.
It systematically evaluates various growth operators, showing that G_stack outperforms methods like direct duplication and learnable expansion in scalability.
The authors propose practical guidelines for optimal growth timing and factor, enabling efficient application of G_stack across different LLM sizes and token scales.

Efficient Pre-training of LLMs Using Depthwise Stacking Operator

This essay provides an in-depth analysis of a paper focusing on improving the efficiency of pre-training LLMs using a depthwise stacking growth operator, denoted as $G_{\text{stack}}$ . The paper addresses three primary obstacles in the domain of model growth methods: the lack of comprehensive evaluation, untested scalability, and the absence of empirical guidelines.

Summary of Contributions

The paper systematically evaluates various model growth techniques and identifies that depthwise stacking $G_{\text{stack}}$ significantly accelerates the pre-training process compared to training models from scratch. Key findings of the paper include:

Comprehensive evaluation of model growth operators.
Validation of the $G_{\text{stack}}$ operator's scalability.
Establishment of guidelines for the practical application of $G_{\text{stack}}$ in LLM pre-training.

Evaluation of Model Growth Techniques

The authors categorize existing model growth techniques into four atomic growth operators: direct duplication ( $G_{\text{direct}}$ ), learnable parameter expansion ( $G_{\text{learn}}$ ), zero initialization ( $G_{\text{zero}}$ ), and random initialization ( $G_{\text{random}}$ ). Each operator is evaluated for scalability in both depthwise and widthwise dimensions. The evaluation reveals that depthwise stacking $G_{\text{stack}}$ consistently outperforms other operators across multiple benchmarks.

Scalability and Efficiency

The authors conduct extensive experiments to test the scalability of the $G_{\text{stack}}$ operator:

Model Size Scaling: Experiments with model sizes up to 7B parameters and training data up to 300B tokens show that $G_{\text{stack}}$ maintains its efficiency, achieving a 54.5% speedup in pre-training for the 3B model and similar gains for larger models.
Training Token Scaling: Pre-training a 410M LLM with 750B tokens demonstrates that $G_{\text{stack}}$ achieves continuous acceleration, indicating its potential for long-duration training tasks.

Practical Guidelines

The paper addresses the lack of empirical guidelines for model growth by estimating the optimal growth timing ( $d$ ) and growth factor ( $g$ ):

Growth Timing ( $d$ ): The authors fit a logarithmic function to determine optimal $d$ based on model parameters and computational budget, generally finding that a value between 10B and 20B tokens optimizes efficiency.
Growth Factor ( $g$ ): Experiments suggest an optimal growth factor between 2 and 4, with a constant factor of 4 recommended for practical applications.

Implications and Future Research

The findings have significant implications for both the theoretical and practical aspects of LLM pre-training. The demonstrated scalability of the $G_{\text{stack}}$ operator suggests that this method can be effectively applied to very large models and extensive training datasets, which is critical as model sizes continue to grow.

Future research could focus on:

Further Exploration of Growth Strategies: Investigate more sophisticated growth strategies beyond depthwise stacking to identify methods that could offer even greater efficiency.
Longitudinal Studies: Conduct longer-term experiments with a wider range of model sizes and training data to solidify the practical guidelines and generalize findings.
Function Preservation and Noise Introduction: Explore the role of function preservation in model growth, as initial findings indicate that controlled introduction of noise can sometimes improve performance.

Conclusion

This paper presents a thorough and systematic evaluation of model growth techniques, with a particular focus on the depthwise stacking operator $G_{\text{stack}}$ . By addressing key obstacles in the efficient pre-training of LLMs, the authors provide valuable insights and practical guidelines that can significantly enhance the pre-training process, offering a noteworthy contribution to the field of generative AI and LLM research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1794938544336568380

https://twitter.com/fly51fly/status/1796911463187067363

https://twitter.com/javaeeeee1/status/1795046334845632774