MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining (2312.17482v2)

Published 29 Dec 2023 in cs.CL and cs.LG

Abstract: Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked LLMing (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

References (67)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces MosaicBERT, a BERT-style encoder that uses FlashAttention, ALiBi, and GLUs to drastically speed up pretraining while maintaining high accuracy.
It achieves a 79.6% average GLUE score in just 1.13 hours on eight A100 GPUs, demonstrating significant cost and time efficiency.
The model establishes a balanced trade-off between training speed and performance, paving the way for cost-effective, domain-specific NLP pretraining.

Overview

MosaicBERT represents a significant evolution in the field of NLP, where BERT-style encoder models are essential tools. The main objective of this new architecture, MosaicBERT, is to optimize pretraining speed while maintaining accuracy in LLM development. MosaicBERT achieves this through a fusion of modern transformer architectures and efficient training techniques, resulting in a nimble and cost-effective approach for researchers and engineers alike.

Architectural Enhancements

At the heart of MosaicBERT are several key architectural changes designed to accelerate pretraining. These include FlashAttention, which streamlines memory operations and thus speeds up processing, and Attention with Linear Biases (ALiBi), which efficiently encodes positional information without learned embeddings. Additionally, improvements like incorporating Gated Linear Units (GLUs) in the feedforward layers, utilizing low precision LayerNorm, and a dynamic mechanism to avoid computational waste on padding tokens play crucial roles in boosting pretraining efficiency.

Pretraining Acceleration

One of the remarkable features of MosaicBERT is its ability to achieve impressive downstream task performance on the GLUE (General Language Understanding Evaluation) benchmark using minimal resources. For instance, MosaicBERT-Base reached an average GLUE score of 79.6% in only 1.13 hours on eight A100 80 GB GPUs, a feat that would cost approximately $20. This marked improvement in training time and expense opens the door for custom BERT-style model development, tailored to specific domains without the prohibitive costs usually associated with such endeavors.

Optimal Performance

MosaicBERT's architecture and pretraining strategies are empirically demonstrated to be highly efficient. The model not only achieves high scores on language understanding benchmarks rapidly but also does so with optimality in terms of accuracy versus training time, a relationship characterized through Pareto curves. MosaicBERT's Base and Large models are systematically compared with the standard BERT models, and both are shown to be Pareto optimal, indicating that they strike an ideal balance between speed and performance.

Conclusion and Contributions

MosaicBERT stands out as a significant contribution to the field of NLP. By combining tested and novel architectural features, along with an optimized training recipe, it delivers a highly efficient and effective model. It offers a practical solution for pretraining custom BERT models swiftly and economically, empowering researchers to push the boundaries of language processing innovations. This inclusive approach heralds a new wave of NLP research, moving away from universal model finetuning to domain-specific pretraining, and further expanding the potential of these powerful LLMs. The authors have made their model weights and code available, underscoring their commitment to facilitating advancement and collaboration within the NLP community.

PDF Markdown

Tweets

https://twitter.com/EitanTurok/status/1747311330825011607

https://twitter.com/Obota_P/status/1750358646586736934

https://twitter.com/kylelostat/status/1780037461361697271

https://twitter.com/cwolferesearch/status/1746176874508656726

https://twitter.com/22146921/status/1742153703669608826