Zero Bubble Pipeline Parallelism (2401.10241v1)

Published 30 Nov 2023 in cs.DC, cs.AI, and cs.LG

Abstract: Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the first to successfully achieve zero pipeline bubbles under synchronous training semantics. The key idea behind this improvement is to split the backward computation into two parts, one that computes gradient for the input and another that computes for the parameters. Based on this idea, we handcraft novel pipeline schedules that significantly outperform the baseline methods. We further develop an algorithm that automatically finds an optimal schedule based on specific model configuration and memory limit. Additionally, to truly achieve zero bubble, we introduce a novel technique to bypass synchronizations during the optimizer step. Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit. This number can be further pushed to 31% when the memory constraint is relaxed. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism. We open sourced our implementation based on the popular Megatron-LM repository on https://github.com/sail-sg/zero-bubble-pipeline-parallelism.

References (21)

Citations (15)

View on Semantic Scholar

Summary

The paper proposes a novel scheduling method that separates gradient computations to remove pipeline bubbles in synchronous distributed training.
It presents two handcrafted schedules and an automatic algorithm that optimize memory usage and throughput, achieving up to 31% performance improvement.
The approach also streamlines optimizer synchronization, ensuring efficient resource utilization for large-scale deep learning models.

Zero Bubble Pipeline Parallelism: An Innovative Scheduling Approach

The paper "Zero Bubble Pipeline Parallelism" by Penghui Qi, Xinyi Wan, and Guangxing Huang introduces an advanced scheduling strategy aimed at mitigating pipeline bubbles in large-scale distributed training. Pipeline parallelism (PP) is an integral mechanism for training deep neural networks (DNNs) distributed over multiple GPUs, but it inherently suffers from inefficiencies termed "pipeline bubbles," which are periods of idle time created due to interdependencies among stages.

Core Contributions

The authors propose a novel scheduling method that uniquely achieves zero pipeline bubbles under synchronous training. The major contributions of this research can be summarized as follows:

Splitting the Backward Computation: The key innovation is the bifurcation of the backward computation into two separate operations: one that computes gradients for inputs (denoted as B) and one for the parameters (denoted as W). This approach allows for more flexible scheduling, significantly reducing sequential dependencies.
Handcrafted Schedules: The paper presents two novel handcrafted schedules—\zbh{1} and \zbh{2}:
- \zbh{1} minimizes sequential dependencies without increasing peak memory consumption.
- \zbh{2} achieves a zero bubble schedule by allowing more memory consumption, filling pipeline bubbles more efficiently.
Automatic Scheduling Algorithm: An automated algorithm is developed to optimize pipeline schedules by considering realistic execution times and memory limits. This heuristic algorithm generates schedules that closely approximate or exceed the performance of handcrafted schedules.
Optimizer Synchronization Bypass: The authors introduce a workaround to circumvent synchronization barriers during the optimizer steps. This is achieved through post-update validation, reducing unnecessary synchronization overhead and maintaining synchronous optimization semantics.
Empirical Evaluations: Rigorous experimental evaluations reveal the proposed methods improve throughput by up to 31% over the conventional 1F1B schedule, demonstrating their practical efficacy. The results are verified with models sized up to 28.3 billion parameters on a distributed setup involving multiple GPUs.

Detailed Insights

Handcrafted Schedules

The schedules \zbh{1} and \zbh{2} are designed to experiment with trade-offs between memory usage and pipeline efficiency. The detailed analysis indicates:

\zbh{1} uses the same peak memory as 1F1B but rearranges the B and W operations, reducing bubble size significantly.
\zbh{2} allows a larger memory footprint. By introducing additional forward passes in the warm-up phase, it completely fills the pipeline stages, leading to zero bubbles.

Automatic Scheduling Algorithm

To handle realistic running conditions, the paper's heuristic algorithm fine-tunes the scheduling. This method takes practical considerations such as communication times ($T_{\text{comm}$), running times of different passes ( $T_F$ , $T_B$ , $T_W$ ), and memory consumption into account to optimize scheduling dynamically. The integer linear programming (ILP) formulation further aids in finding optimal or near-optimal scheduling.

Memory Efficiency

Real-world applicability of the method is further enhanced by emphasizing memory efficiency. ZB-V scheduling achieves zero bubbles while maintaining the same memory constraints as 1F1B, balancing the trade-off between microbatch size and pipeline bubble size proficiently.

Implications and Future Directions

The implications of achieving zero bubble pipeline parallelism are substantial. Practically, this research optimizes GPU utilization, reducing training times for large-scale models significantly. Theoretically, it opens up new avenues for improving distributed training frameworks. Potential future developments could delve into more intricate dynamic scheduling methods, hybrid parallelism strategies encompassing tensor, data, and pipeline parallelism, and further refinement of memory efficiency techniques.

The advancements presented in "Zero Bubble Pipeline Parallelism" not only push the boundaries of parallel computing for DNN training but also set a robust foundation for future innovations in distributed learning systems. As models scale larger, the need for such efficient and memory-conscious parallelism strategies will become ever more critical, making this research a pivotal reference in the field of AI and large-scale machine learning.

PDF Markdown

Related Papers

GitHub

GitHub - sail-sg/zero-bubble-pipeline-parallelism: Zero Bubble Pipeline Parallelism (406 stars)

Tweets

https://twitter.com/_akhaliq/status/1749289751058710685

https://twitter.com/johannes_hage/status/1782170688515665965

https://twitter.com/fly51fly/status/1749559570811469827

https://twitter.com/samsja19/status/1883331546129879112

https://twitter.com/johannes_hage/status/1841584086516957669

https://twitter.com/MattBeton/status/1893788949388136725

YouTube

Show All Videos