Emergent Mind

MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

(2312.17482)
Published Dec 29, 2023 in cs.CL and cs.LG

Abstract

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

Overview

  • MosaicBERT introduces a fast pretraining process while maintaining accuracy in NLP model development.

  • Key architectural enhancements include FlashAttention, ALiBi, Gated Linear Units, low precision LayerNorm, and a padding token optimization.

  • MosaicBERT reduces costs and resources, achieving a high GLUE benchmark score with minimal GPU usage and time, making it cost-effective.

  • The model's efficiency is validated through Pareto curves, showing an ideal balance between pretraining speed and accuracy.

  • The paper emphasizes MosaicBERT's utility in domain-specific pretraining, and shares the model weights and code for further research.

Overview

MosaicBERT represents a significant evolution in the field of NLP, where BERT-style encoder models are essential tools. The main objective of this new architecture, MosaicBERT, is to optimize pretraining speed while maintaining accuracy in language model development. MosaicBERT achieves this through a fusion of modern transformer architectures and efficient training techniques, resulting in a nimble and cost-effective approach for researchers and engineers alike.

Architectural Enhancements

At the heart of MosaicBERT are several key architectural changes designed to accelerate pretraining. These include FlashAttention, which streamlines memory operations and thus speeds up processing, and Attention with Linear Biases (ALiBi), which efficiently encodes positional information without learned embeddings. Additionally, improvements like incorporating Gated Linear Units (GLUs) in the feedforward layers, utilizing low precision LayerNorm, and a dynamic mechanism to avoid computational waste on padding tokens play crucial roles in boosting pretraining efficiency.

Pretraining Acceleration

One of the remarkable features of MosaicBERT is its ability to achieve impressive downstream task performance on the GLUE (General Language Understanding Evaluation) benchmark using minimal resources. For instance, MosaicBERT-Base reached an average GLUE score of 79.6% in only 1.13 hours on eight A100 80 GB GPUs, a feat that would cost approximately $20. This marked improvement in training time and expense opens the door for custom BERT-style model development, tailored to specific domains without the prohibitive costs usually associated with such endeavors.

Optimal Performance

MosaicBERT's architecture and pretraining strategies are empirically demonstrated to be highly efficient. The model not only achieves high scores on language understanding benchmarks rapidly but also does so with optimality in terms of accuracy versus training time, a relationship characterized through Pareto curves. MosaicBERT's Base and Large models are systematically compared with the standard BERT models, and both are shown to be Pareto optimal, indicating that they strike an ideal balance between speed and performance.

Conclusion and Contributions

MosaicBERT stands out as a significant contribution to the field of NLP. By combining tested and novel architectural features, along with an optimized training recipe, it delivers a highly efficient and effective model. It offers a practical solution for pretraining custom BERT models swiftly and economically, empowering researchers to push the boundaries of language processing innovations. This inclusive approach heralds a new wave of NLP research, moving away from universal model finetuning to domain-specific pretraining, and further expanding the potential of these powerful language models. The authors have made their model weights and code available, underscoring their commitment to facilitating advancement and collaboration within the NLP community.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.