Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
This paper introduces the ST-MoE model, designed to address stability and performance issues in Mixture-of-Experts (MoE) models by implementing various design and training improvements.
New stability techniques and insights into fine-tuning dynamics are presented, including the introduction of the router z-loss and the advocacy for top-2 routing to improve model efficiency without compromising quality.
A large-scale analysis reveals the importance of careful hyperparameter selection and architectural decisions in creating efficient and stable sparse models.
The ST-MoE-32B model outperforms existing models on various NLP tasks, showcasing the potential of efficiently scaled sparse models.
The Mixture-of-Experts (MoE) model represents a class of deep neural networks designed to significantly scale the model's capacity while maintaining manageable computation costs. Unlike uniform architectures that scale monolithically, MoE models achieve efficiency and capacity by selectively activating only a subset of parameters (referred to as experts) for each input. This design allows MoE models to increase the number of parameters without proportionally increasing the computational overhead.
While MoE models have demonstrated substantial improvements in natural language processing tasks, their adoption has faced hurdles. Two notable issues have been:
This paper addresses the aforementioned challenges through several contributions to the MoE model design and training regime. The key contributions include:
Leveraging our design and training improvements, we scaled a sparse model to 269 billion parameters, achieving a computational cost comparable to a 32 billion parameter dense encoder-decoder transformer (designated as ST-MoE-32B). This model sets new state-of-the-art performance benchmarks across a diverse array of NLP tasks, including reasoning, summarization, closed book question answering, and adversarially constructed tasks, outperforming existing sparse and dense models.
Our research presents a pragmatic approach to designing stable, efficient, and transferable MoE models. By focusing on both upstream pre-training and downstream fine-tuning metrics, we mitigate the quality gap observed in earlier sparse models. The success of our ST-MoE-32B model accentuates the potential of sparse models to achieve superior performance across a gamut of NLP benchmarks efficiently.
Looking forward, the adaptive nature of MoE models offers avenues for even more dynamic and efficient architectures. Future work could explore more nuanced routing mechanisms, heterogeneous expert designs, and the integration of MoE models with other modalities. Additionally, refining the fine-tuning process to further bridge the gap between pre-training efficacy and fine-tuning performance remains a promising area of exploration.