Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 179 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

ST-MoE: Designing Stable and Transferable Sparse Expert Models (2202.08906v2)

Published 17 Feb 2022 in cs.CL and cs.LG

Abstract: Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable LLMs. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).

Citations (133)

Summary

  • The paper introduces stability techniques, including router z-loss, to prevent training divergence in sparse MoE models.
  • It details a comprehensive fine-tuning analysis that highlights unique hyperparameter sensitivities in sparse architectures.
  • The study presents architectural guidelines, such as top-2 routing and optimal expert allocation, to balance compute and memory while scaling to 269B parameters.

Designing and Scaling Sparse Mixture of Experts Models for Improved Efficiency and Performance

Introduction to Mixture-of-Experts (MoE)

The Mixture-of-Experts (MoE) model represents a class of deep neural networks designed to significantly scale the model's capacity while maintaining manageable computation costs. Unlike uniform architectures that scale monolithically, MoE models achieve efficiency and capacity by selectively activating only a subset of parameters (referred to as experts) for each input. This design allows MoE models to increase the number of parameters without proportionally increasing the computational overhead.

Challenges with MoE Models

While MoE models have demonstrated substantial improvements in natural language processing tasks, their adoption has faced hurdles. Two notable issues have been:

  1. Training Instability: MoE models have been prone to training instabilities, leading to divergences that render a proportion of training runs unsuccessful.
  2. Quality Degradation in Fine-tuning: Despite showing promise in pre-training regimes, MoE models often underperform denser counterparts when fine-tuned on specific tasks.

Our Contributions

This paper addresses the aforementioned challenges through several contributions to the MoE model design and training regime. The key contributions include:

  1. A larger-scale paper on the quality-stability trade-offs of various stability techniques, culminating in the introduction of the router z-loss as a mechanism to combat instability without sacrificing model quality.
  2. A comprehensive analysis of fine-tuning dynamics in sparse models, unveiling a notable sensitivity to batch size and learning rate, as opposed to dense models. Our findings suggest a distinct hyperparameter sensitivity pattern for sparse models, advocating for adjustments in the fine-tuning protocol.
  3. The introduction of architectural and model design principles for creating Pareto efficient sparse models. Particularly, we recommend using top-2 routing with a train capacity factor of 1.25 and maintaining at most one expert per core to balance compute and memory usage efficiently.
  4. Insights into token routing decisions across expert layers, revealing patterns of specialization among experts.

Performance of ST-MoE-32B Model

Leveraging our design and training improvements, we scaled a sparse model to 269 billion parameters, achieving a computational cost comparable to a 32 billion parameter dense encoder-decoder transformer (designated as ST-MoE-32B). This model sets new state-of-the-art performance benchmarks across a diverse array of NLP tasks, including reasoning, summarization, closed book question answering, and adversarially constructed tasks, outperforming existing sparse and dense models.

Implications and Future Directions

Our research presents a pragmatic approach to designing stable, efficient, and transferable MoE models. By focusing on both upstream pre-training and downstream fine-tuning metrics, we mitigate the quality gap observed in earlier sparse models. The success of our ST-MoE-32B model accentuates the potential of sparse models to achieve superior performance across a gamut of NLP benchmarks efficiently.

Looking forward, the adaptive nature of MoE models offers avenues for even more dynamic and efficient architectures. Future work could explore more nuanced routing mechanisms, heterogeneous expert designs, and the integration of MoE models with other modalities. Additionally, refining the fine-tuning process to further bridge the gap between pre-training efficacy and fine-tuning performance remains a promising area of exploration.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 374 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com