MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services (2205.10034v3)

Published 20 May 2022 in cs.DC and cs.AI

Abstract: While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.

References (45)

Citations (24)

View on Semantic Scholar

Summary

The paper demonstrates that SE-MoE significantly improves distributed MoE training and inference by optimizing load balancing and resource utilization.
It introduces innovative prefetch scheduling and a contiguous ring memory strategy to overcome GPU memory constraints and boost throughput.
Empirical results show record training speeds and reduced memory footprint, outperforming existing systems like DeepSpeed.

Introduction

The paper introduces SE-MoE, a framework designed to improve the efficiency and scalability of distributed training and inference of Mixture-of-Experts (MoE) models. MoE models are an asset for training larger models within the constraints of limited computational resources by activating only a subset of parameters for each input. DeepSpeed has made strides in this area, but the paper signifies that further improvements are possible, particularly concerning load balancing, communication and computation efficiency, and memory storage limitations.

Enhancing MoE Training and Inference

SE-MoE addresses several challenges in the field of MoE models. It utilizes Elastic MoE training to control load balancing and communication through intuitive prefetch scheduling and innovative communication methods. This strategic approach enhances parallelism during training and stretches across hierarchical storage solutions. For inference, particularly for models that surpass GPU memory capacity, SE-MoE sets out an approach whereby CPU and GPU memory are fashioned into a contiguous ring, allowing computation to cycle through the sections efficiently. This circumvents the memory constraints typically imposed by GPU limitations.

Empirical Verification

Through extensive experimentation, SE-MoE's capacity to outperform existing systems like DeepSpeed has been showcased. It has successfully trained an MoE-based Unified Feature Optimization (UFO) model with 12 billion parameters in record time while achieving considerably higher throughput in both training and inference phases. Notably, under scenarios where an unbalanced workload is present – a common scenario in multi-task learning – SE-MoE presents a remarkable improvement in throughput and reduces memory footprint significantly.

Futuristic Perspectives

This paper's contributions to MoE model training and inference represent a significant advance in machine learning infrastructure, providing a beacon for future work to progress toward more efficient, resource-aware, and scalable MoE systems. The SE-MoE framework, which will be publicly available, signifies a leap towards training extraordinarily large models more feasibly and with consideration to energy efficiency and environmental impact. The promise of this research opens the door to further optimization that will bolster the position of sparsely activated network-based training in a variety of machine learning tasks, pushing the boundaries of current models in terms of size, speed, and efficiency.

PDF Markdown

Related Papers

GitHub

GitHub - PaddlePaddle/Paddle: PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署） (21,708 stars)