Emergent Mind

ST-MoE: Designing Stable and Transferable Sparse Expert Models

(2202.08906)
Published Feb 17, 2022 in cs.CL and cs.LG

Abstract

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Fine-tuning subset of model parameters improves sparse models' generalization, except MoE parameters, which reduce quality.

Overview

  • This paper introduces the ST-MoE model, designed to address stability and performance issues in Mixture-of-Experts (MoE) models by implementing various design and training improvements.

  • New stability techniques and insights into fine-tuning dynamics are presented, including the introduction of the router z-loss and the advocacy for top-2 routing to improve model efficiency without compromising quality.

  • A large-scale analysis reveals the importance of careful hyperparameter selection and architectural decisions in creating efficient and stable sparse models.

  • The ST-MoE-32B model outperforms existing models on various NLP tasks, showcasing the potential of efficiently scaled sparse models.

Designing and Scaling Sparse Mixture of Experts Models for Improved Efficiency and Performance

Introduction to Mixture-of-Experts (MoE)

The Mixture-of-Experts (MoE) model represents a class of deep neural networks designed to significantly scale the model's capacity while maintaining manageable computation costs. Unlike uniform architectures that scale monolithically, MoE models achieve efficiency and capacity by selectively activating only a subset of parameters (referred to as experts) for each input. This design allows MoE models to increase the number of parameters without proportionally increasing the computational overhead.

Challenges with MoE Models

While MoE models have demonstrated substantial improvements in natural language processing tasks, their adoption has faced hurdles. Two notable issues have been:

  1. Training Instability: MoE models have been prone to training instabilities, leading to divergences that render a proportion of training runs unsuccessful.
  2. Quality Degradation in Fine-tuning: Despite showing promise in pre-training regimes, MoE models often underperform denser counterparts when fine-tuned on specific tasks.

Our Contributions

This paper addresses the aforementioned challenges through several contributions to the MoE model design and training regime. The key contributions include:

  1. A larger-scale study on the quality-stability trade-offs of various stability techniques, culminating in the introduction of the router z-loss as a mechanism to combat instability without sacrificing model quality.
  2. A comprehensive analysis of fine-tuning dynamics in sparse models, unveiling a notable sensitivity to batch size and learning rate, as opposed to dense models. Our findings suggest a distinct hyperparameter sensitivity pattern for sparse models, advocating for adjustments in the fine-tuning protocol.
  3. The introduction of architectural and model design principles for creating Pareto efficient sparse models. Particularly, we recommend using top-2 routing with a train capacity factor of 1.25 and maintaining at most one expert per core to balance compute and memory usage efficiently.
  4. Insights into token routing decisions across expert layers, revealing patterns of specialization among experts.

Performance of ST-MoE-32B Model

Leveraging our design and training improvements, we scaled a sparse model to 269 billion parameters, achieving a computational cost comparable to a 32 billion parameter dense encoder-decoder transformer (designated as ST-MoE-32B). This model sets new state-of-the-art performance benchmarks across a diverse array of NLP tasks, including reasoning, summarization, closed book question answering, and adversarially constructed tasks, outperforming existing sparse and dense models.

Implications and Future Directions

Our research presents a pragmatic approach to designing stable, efficient, and transferable MoE models. By focusing on both upstream pre-training and downstream fine-tuning metrics, we mitigate the quality gap observed in earlier sparse models. The success of our ST-MoE-32B model accentuates the potential of sparse models to achieve superior performance across a gamut of NLP benchmarks efficiently.

Looking forward, the adaptive nature of MoE models offers avenues for even more dynamic and efficient architectures. Future work could explore more nuanced routing mechanisms, heterogeneous expert designs, and the integration of MoE models with other modalities. Additionally, refining the fine-tuning process to further bridge the gap between pre-training efficacy and fine-tuning performance remains a promising area of exploration.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube