Learning to Grow Pretrained Models for Efficient Transformer Training (2303.00980v1)

Published 2 Mar 2023 in cs.LG

Abstract: Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.

Citations (42)

View on Semantic Scholar

Summary

The paper introduces LiGO, a linear growth operator that transfers parameters from pretrained smaller models to efficiently initialize larger transformer architectures.
LiGO uses sparse expansion operators and Kronecker factorization to optimize width and depth transformations across key components of the model.
Experiments on models like BERT, GPT-2, and DeiT demonstrate up to 50% reduction in training cost while maintaining or improving performance.

Efficient Transformer Training via Learned Linear Growth Operators

Introduction

In recent years, the scaling of transformer models has been a significant driver of progress in the field of deep learning. However, this scaling comes at a steep computational cost, particularly because new, larger models are frequently trained from scratch even though they are often just scaled versions of smaller, existing models. This paper introduces a novel approach to leverage previously trained smaller models to initialize and thus accelerate the training of larger models. The proposed method, termed Learned Linear Growth Operator (LiGO), employs a data-driven mechanism to learn a linear mapping that transforms the parameters of a smaller pretrained model into an effective initialization for a larger model.

Methodology

Linear Growth Operator

LiGO operationalizes the idea that a larger model's parameters can be effectively initialized through a linear transformation of a smaller, pretrained model's parameters. Given the impracticality of learning a mapping between the full parameter spaces of small and large models directly, the paper structures this linear mapping via sparse width and depth expansion operators. These operators are further refined using Kronecker factorization, introducing efficiencies by parameter grouping across layers and neurons and embedding architectural knowledge within the transformation.

Application to Transformers

Specific implementation details include how embedding layers, attention mechanisms, and feedforward networks within transformers are transformed via LiGO. To ensure seamless integration across various components of a transformer architecture, parameter tying strategies are employed. The process addresses the embedding dimension transformations and aligns with the multi-headed nature of attention layers by design. This attention is extended to sequence-to-sequence tasks where token representations in embedding layers are transformed to address width growth and subsequent transformations throughout the model to align with these changes.

Experimentation and Results

Extensive experiments were conducted on a range of models including BERT, RoBERTa, GPT-2, and vision transformers like DeiT and CaiT. Across the board, LiGO demonstrated notable efficiency gains, saving up to 50% of the computational cost of training models from scratch while also maintaining or surpassing baseline performance levels on downstream tasks. Specifically, training efficiency improvements were observed not only in terms of parameter growth but also in real-world metrics like GPU wall time.

Comparison with Existing Methods

LiGO was pitted against several existing methods aimed at improving training efficiency through model growth, including StackBERT, MSLT, and bert2BERT. Notably, LiGO outperformed these methods, underscoring its effectiveness as a model growth and initialization strategy. Furthermore, LiGO presents an adaptable approach that's robust across different domains (language and vision tasks), model architectures, and optimization settings.

Implications and Future Directions

The results presented in this paper illustrate the potential of leveraging pretrained models for more efficient training of larger models. By adopting a structured and learned approach to parameter initialization, LiGO addresses the computational redundancy inherent in the current practice of training scaled-up models from scratch. This research has practical implications for ongoing efforts to scale transformer models and represents a step forward in the pursuit of more computationally efficient deep learning methodologies.

Looking ahead, several avenues for further research emerge. An immediate question is the applicability of LiGO to the very largest models currently in use, such as those in the GPT-3 family. Additionally, integrating LiGO with other efficient training strategies, such as layer and token dropping or staged training, could yield further efficiencies. Finally, the potential of LiGO to facilitate more dynamic scaling processes, where models are continuously grown and adapted to new tasks or data, offers an exciting future direction for exploration.

In summary, the Learned Linear Growth Operator (LiGO) presents a significant advance in the efficient training of scaled-up transformer models. By enabling the direct transfer of learned parameters from smaller to larger models, LiGO offers a promising route to mitigating the computational costs associated with ongoing model scaling efforts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/teortaxesTex/status/1753064515099615282

https://twitter.com/Ethan_smith_20/status/1753005070000791963