Emergent Mind

Abstract

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

Overview

  • Introduces DeepSpeed-MoE as an efficient solution for Mixture-of-Experts model inference within the DeepSpeed library.

  • Presents a new MoE architecture, Pyramid-Residual MoE, enhancing parameter efficiency and model distribution.

  • Proposes the Mixture-of-Students technique using PR-MoE for smaller models to preserve quality while reducing size.

  • Achieves significant reduction in inference latency and cost, enabling practical use of trillion-parameter MoE models.

  • Sets the stage for scalable AI with reduced computational resources, providing research and code publicly for community involvement.

Overview of DeepSpeed-MoE

Mixture-of-Experts (MoE) models have risen as a prominent architecture to address the increasing demand for model quality without proportional increase in training costs. These models, however, confront substantial challenges particularly in terms of inference due to their massive parameter count and unique architecture traits. The paper introduces DeepSpeed-MoE, a comprehensive solution that significantly enhances the efficiency and scalability of inference for MoE models as part of the DeepSpeed library.

Innovating Model Architecture and Training

Research has demonstrated that MoE models can drastically reduce training costs by 3 to 5 times over traditional dense models while maintaining comparable quality. The paper forwards this with the introduction of Pyramid-Residual MoE (PR-MoE), a new MoE architecture that cleverly uses residual connections and distributes experts more effectively across the model. This innovation reduces the parameter count by up to threefold without compromising model quality. Furthermore, the study investigates a new technique called Mixture-of-Students (MoS), where PR-MoE serves as a teacher model for smaller-sized student MoE models, leveraging knowledge distillation to achieve up to 3.7 times reduction in model size while preserving model accuracy.

Reimagining MoE Inference

When it comes to inference, MoE models face performance challenges stemming from their larger memory requirements. DeepSpeed-MoE overcomes this by deploying a highly optimized inference system that realizes superb scaling across GPUs. The system is capable of offering up to 7.3 times reduction in inference latency and a significant cut in cost compared to existing MoE solutions. This results in ultralow latency for trillion-parameter MoE models, making massive MoE models viable for real-world applications.

Implications and Future Directions

This comprehensive approach to enhancing MoE models for both training and inference could set the stage for next-generation AI scalability. With systems like DeepSpeed-MoE, larger and higher-quality models can be developed and deployed using less computational resources, thus broadening the horizons for AI research and application. This moves the AI field towards more efficient and economical alternatives, as experts anticipate further innovations in large model landscapes, transitioning emphasis from dense to sparse MoE models.

The research, code, and tutorials for DeepSpeed-MoE are available online, and experiments have been conducted on Microsoft Azure AI platform, inviting wider community participation in advancing this domain. The improved parameter efficiency, scale, and reduced inference costs presented in this work underscore a significant step forward in operationalizing gargantuan MoE models for practical use, promising advancements in AI capabilities without the corresponding increase in computational demands.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.