Emergent Mind

Abstract

With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.

Overview

  • SE-MoE improves efficiency and scalability in training/inference of MoE models, maximizing computational resources.

  • The framework employs Elastic MoE training for better load balancing and reduces communication overhead with novel methods.

  • SE-MoE uses a unified CPU-GPU memory ring for processing models exceeding GPU memory limits, enhancing inference efficiency.

  • Empirical tests show that SE-MoE outperforms DeepSpeed, achieving faster training of a 12-billion parameter model and higher throughput.

  • The advancements signify a step towards training larger models sustainably, impacting future machine learning infrastructure.

Introduction

The paper introduces SE-MoE, a framework designed to improve the efficiency and scalability of distributed training and inference of Mixture-of-Experts (MoE) models. MoE models are an asset for training larger models within the constraints of limited computational resources by activating only a subset of parameters for each input. DeepSpeed has made strides in this area, but the paper signifies that further improvements are possible, particularly concerning load balancing, communication and computation efficiency, and memory storage limitations.

Enhancing MoE Training and Inference

SE-MoE addresses several challenges in the realm of MoE models. It utilizes Elastic MoE training to control load balancing and communication through intuitive prefetch scheduling and innovative communication methods. This strategic approach enhances parallelism during training and stretches across hierarchical storage solutions. For inference, particularly for models that surpass GPU memory capacity, SE-MoE sets out an approach whereby CPU and GPU memory are fashioned into a contiguous ring, allowing computation to cycle through the sections efficiently. This circumvents the memory constraints typically imposed by GPU limitations.

Empirical Verification

Through extensive experimentation, SE-MoE's capacity to outperform existing systems like DeepSpeed has been showcased. It has successfully trained an MoE-based Unified Feature Optimization (UFO) model with 12 billion parameters in record time while achieving considerably higher throughput in both training and inference phases. Notably, under scenarios where an unbalanced workload is present – a common scenario in multi-task learning – SE-MoE presents a remarkable improvement in throughput and reduces memory footprint significantly.

Futuristic Perspectives

This paper's contributions to MoE model training and inference represent a significant advance in machine learning infrastructure, providing a beacon for future work to progress toward more efficient, resource-aware, and scalable MoE systems. The SE-MoE framework, which will be publicly available, signifies a leap towards training extraordinarily large models more feasibly and with consideration to energy efficiency and environmental impact. The promise of this research opens the door to further optimization that will bolster the position of sparsely activated network-based training in a variety of machine learning tasks, pushing the boundaries of current models in terms of size, speed, and efficiency.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.