HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System

Published 28 Mar 2022 in cs.DC | (2203.14685v3)

Abstract: As giant dense models advance quality but require large amounts of GPU budgets for training, the sparsely gated Mixture-of-Experts (MoE), a kind of conditional computation architecture, is proposed to scale models while keeping their computation constant. Specifically, the input tokens are routed by the gate network and only activates part of the expert network. Existing MoE training systems only support part of mainstream MoE models (e.g. Top k) training under expensive high-bandwidth GPU clusters. In this paper, we present HetuMoE, a high-performance large-scale sparse MoE training system built on Hetu. HetuMoE provides multiple gating strategies and efficient GPU kernel implementations. To further improve the training efficiency on commodity GPU clusters (e.g, with only 1 NiC), we introduce the hierarchical AllToAll communication that combines hierarchical networks and aggregating messages. Compared with existing state-of-the-art MoE systems, HetuMoE obtains at least 15% speedup. Specifically, HetuMoE outperforms DeepSpeed-MoE up to 8.1x under the switch gate with a batch size of 32. Our code is available at: https://github.com/PKU-DAIR/Hetu.

Abstract PDF Upgrade to Chat

Citations (28)

View on Semantic Scholar

Summary

The paper introduces HetuMoE, which supports diverse gating strategies to enhance flexibility in MoE training.
The paper proposes a hierarchical All-To-All communication method that significantly reduces network bottlenecks.
The paper demonstrates a 25% speed improvement via optimized GPU kernels, outperforming existing MoE systems.

An Examination of HetuMoE: Advancements in Trillion-scale Mixture-of-Expert Distributed Training Systems

The paper "HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System" presents a novel system for training large-scale sparsely gated Mixture-of-Experts (MoE) models more efficiently on commodity GPU clusters. The proposed system, HetuMoE, is built on the Hetu deep learning framework and addresses key challenges in MoE systems related to communication bottlenecks and support for various gating strategies.

Key Contributions and Methodologies

The study delivers multiple contributions to the field of distributed training systems, particularly focusing on MoE architectures:

Comprehensive Support for Gating Strategies: Unlike existing MoE frameworks that provide limited gating options, HetuMoE encompasses a broad spectrum of gating strategies, including Switch, GShard, M6, BASE Layer, Hash Layer, SAM, and Dense-to-Sparse. This versatility facilitates the exploration and deployment of MoE models with different operational characteristics and requirements.
Hierarchical All-To-All Communication: One of the primary bottlenecks of MoE training on distributed systems is communication overhead, especially in resource-limited settings. HetuMoE introduces a hierarchical All-To-All communication pattern that reduces network congestion by efficiently utilizing both intra-node and inter-node bandwidth. This method leads to significant improvements in data transfer rates and overall training efficiency, particularly when scaling across multiple nodes with modest networking setups.
Optimized GPU Kernel Implementations: The paper outlines specific optimizations in GPU kernel implementation, notably in the Top-k operations crucial for MoE’s gating networks. Through these tailored optimizations, HetuMoE achieves a marked reduction in computational overhead compared with standard PyTorch implementations, exhibiting an average speed improvement of 25%.

Experimental Results

The efficacy of HetuMoE is validated through extensive evaluations, comparing its performance against leading MoE systems such as DeepSpeed-MoE and FastMoE. Under architectures equipped with sparsely activated switches and GShard gates, HetuMoE demonstrates superior speed, achieving at least a 15% speedup across different batch sizes. The system's performance peaks with an up to 8.1 times speed advantage over DeepSpeed-MoE using the Switch gate with a batch size of 32. This positions HetuMoE as not only a versatile but also a highly efficient solution for large-scale model training.

Implications and Future Directions

The development of HetuMoE carries significant theoretical and practical implications. By providing robust support for diverse gating strategies and optimizing communication protocols, HetuMoE enhances the practical applicability of MoE models across varying hardware configurations, reducing the exclusivity of high-speed, high-cost infrastructure. This democratizes access to MoE architectures, potentially accelerating research and deployment in natural language processing and computer vision fields.

Looking forward, several potential research directions are apparent. First, expanding and refining the hierarchical communication strategies and exploring their integration with emergent networking technologies could further diminish latency and improve the scalability of distributed training systems. Moreover, investigating adaptive learning mechanisms that dynamically modulate gating strategies in response to dataset characteristics or resource availability could enhance model performance and efficiency. Lastly, extending these findings into real-world applications and evaluating the adaptability of HetuMoE in heterogeneous environments remains an exciting avenue for future work.

Overall, HetuMoE represents a significant stride in enhancing the efficiency and accessibility of large-scale distributed training systems, aligning the ongoing advancement of AI models with practical deployment capabilities.

Markdown Report Issue