Emergent Mind

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

(2402.15627)
Published Feb 23, 2024 in cs.LG and cs.DC

Abstract

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training LLMs at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

Overview

  • MegaScale introduces advancements in training LLMs efficiently and stably on over 10,000 GPUs, focusing on full-stack optimization.

  • The system employs parallel Transformer Block and Sliding Window Attention, the LAMB Optimizer, and mixed parallelism strategies to enhance throughput and model accuracy.

  • To ensure stability at scale, MegaScale integrates Automated Diagnostic and Recovery Mechanisms and provides In-Depth Observability Tools for real-time performance monitoring.

  • Operational experience with MegaScale has demonstrated significant efficiency improvements, stability in long-term runs, and offered insights for future AI systems research.

Scaling Large Language Model Training with MegaScale: Achievements at 10,000 GPU Scale

Introduction

MegaScale constitutes a significant advancement in the field of LLMs training, focusing on maximizing training efficiency and stability across an architecture scaling beyond 10,000 GPUs. Through a comprehensive design and implementation approach, the MegaScale system elevates the execution of training LLMs, addressing the dual challenges of achieving high training efficiency and ensuring stability throughout the extended training periods typical of LLMs.

Design Principles and System Overview

MegaScale embodies a full-stack approach, optimizing across various axes including model block and optimizer design, computation and communication overlapping, and network performance tuning. Central to its design philosophy are the principles of algorithm-system co-design and in-depth observability, facilitating optimizations that span the entirety of the system stack to ensure not only the efficiency but also the robustness required for large-scale deployments.

Algorithmic and System-Level Optimizations

The system introduces several key innovations:

  • Parallel Transformer Block and Sliding Window Attention techniques are adapted to support efficient model architecture modifications without sacrificing accuracy.
  • LAMB Optimizer adjustments allow scaling of the batch size significantly, enhancing throughput and reducing pipeline bubbles—a critical factor in large-scale model training.
  • Mixed Parallelism Strategies are utilized to strike an optimal balance between data parallelism, pipeline parallelism, tensor parallelism, and sequence parallelism, ensuring maximum hardware utilization.
  • Advanced Communication Overlapping Techniques are deployed to minimize the latency introduced by the heavy communication demands inherent in distributed LLM training, significantly improving Model FLOPs Utilization (MFU).
  • Custom Network Topology and Performance Tuning are undertaken to address the unique network performance challenges presented by the scale of the deployment.

Stability and Fault Tolerance

On the front of stability and fault tolerance, MegaScale demonstrates a robust training framework suited to the demands of LLM training at scale:

  • The introduction of Automated Diagnostic and Recovery Mechanisms ensures that the system can identify, diagnose, and recover from a wide array of faults with minimal intervention, maintaining high levels of effective training time.
  • In-Depth Observability Tools have been developed to provide granular insights into system performance and behavior, enabling rapid identification and resolution of both anticipated and unforeseen issues.

Performance and Operational Experience

MegaScale's design and optimizations have led to notable practical achievements in the training of LLMs:

  • Efficiency Improvement: In comparative benchmarks, MegaScale achieved a 55.2% MFU when training a 175 billion parameter model across 12,288 GPUs—a 1.34× improvement over the previous state-of-the-art, Megatron-LM.

  • Stability in Long-Term Runs: Real-world deployment scenarios demonstrate the system's capability to maintain model convergence and effectively manage faults over extended periods, showcasing the maturity of its fault tolerance mechanisms.
  • Operational Insights: The system's operational deployment yielded valuable insights, particularly concerning the diagnosis and resolution of computational stragglers and network performance issues, underscoring the practical benefits of its diagnostic tools and robust training framework.

Implications and Future Directions

The achievements of MegaScale in LLM training represent a significant step forward in the field of AI systems research, providing a scalable, efficient, and robust framework for the development of next-generation AI models. The experiences and insights derived from the MegaScale project also highlight areas for future research, particularly in the realms of fault diagnosis and recovery in vast distributed systems, further optimizations in communication strategies, and the continuous need for innovations in model and optimizer design.

With the ongoing rapid evolution of LLMs and their applications, MegaScale not only sets new benchmarks for large-scale model training but also opens up pathways for future advancements in AI systems design and implementation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.