Papers
Topics
Authors
Recent
2000 character limit reached

Zero Bubble Pipeline Parallelism (2401.10241v1)

Published 30 Nov 2023 in cs.DC, cs.AI, and cs.LG

Abstract: Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the first to successfully achieve zero pipeline bubbles under synchronous training semantics. The key idea behind this improvement is to split the backward computation into two parts, one that computes gradient for the input and another that computes for the parameters. Based on this idea, we handcraft novel pipeline schedules that significantly outperform the baseline methods. We further develop an algorithm that automatically finds an optimal schedule based on specific model configuration and memory limit. Additionally, to truly achieve zero bubble, we introduce a novel technique to bypass synchronizations during the optimizer step. Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit. This number can be further pushed to 31% when the memory constraint is relaxed. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism. We open sourced our implementation based on the popular Megatron-LM repository on https://github.com/sail-sg/zero-bubble-pipeline-parallelism.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp.  578–594, 2018.
  3. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.  431–445, 2021.
  4. Cbc user guide. In Emerging theory, methods, and applications, pp.  257–277. INFORMS, 2005.
  5. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  6. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
  7. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  8. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  9. Mlir: A compiler infrastructure for the end of moore’s law. arXiv preprint arXiv:2002.11054, 2020.
  10. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
  11. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  12. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  13. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2021.
  14. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp.  1310–1318. Pmlr, 2013.
  15. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  16. Relay: A new ir for machine learning frameworks. In Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages, pp.  58–68, 2018.
  17. Amit Sabne. Xla : Compiling machine learning for peak performance, 2020.
  18. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp.  10–19, 2019.
  19. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  20. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3:269–296, 2021.
  21. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp.  559–578, 2022.
Citations (15)

Summary

  • The paper demonstrates a novel scheduling method that splits backward computation to eliminate pipeline bubbles in distributed training.
  • It introduces both handcrafted and automated scheduling techniques using integer linear programming and heuristic strategies.
  • Experimental results show up to a 31% increase in throughput with optimized memory usage compared to traditional 1F1B methods.

Zero Bubble Pipeline Parallelism

Introduction

The paper "Zero Bubble Pipeline Parallelism" addresses a major challenge in distributed model training: the inefficiency caused by pipeline bubbles in pipeline parallelism (PP). As the complexity and size of neural networks increase, distributed training involving multiple GPUs has become essential. Traditional methods like data parallelism (DP) and model parallelism, including tensor parallelism (TP) and PP, have been optimized to various extents. However, pipeline bubbles—idle times during execution due to dependencies between stages—persist as a bottleneck. This research proposes a novel scheduling strategy that eliminates these bubbles while maintaining synchronous training semantics.

Key Concepts and Methodology

The main innovation of this paper is a scheduling strategy that eliminates pipeline bubbles. The strategy involves splitting the backward computation into two parts: one for computing gradients with respect to the input, and the other for parameters. This split allows the researchers to design novel pipeline schedules that are highly efficient. The paper introduces both handcrafted and automated scheduling techniques tailored to specific model configurations and memory limitations.

Handcrafted Schedules: The authors initially present two handcrafted pipeline schedules. The first one, \zbh{1}, maintains similar peak memory usage as conventional 1F1B schedules but with reduced bubble sizes. The second one, \zbh{2}, achieves zero bubbles but with higher memory consumption. Figure 1

Figure 1: Handcrafted pipeline schedules, top: \zbh{1}; bottom: \zbh{2}.

Automatic Scheduling: The paper further develops an automatic scheduling algorithm, leveraging integer linear programming and heuristic methods to optimize scheduling under realistic conditions. This algorithm considers the execution time and memory constraints, achieving near-zero bubble rates while maintaining computational efficiency.

Experimental Results and Implications

The method was evaluated against baseline methods such as 1F1B and 1F1B-I, showing significant improvement in throughput. For instance, the proposed \zb{2} schedule outperforms traditional 1F1B scheduling by up to 31% when memory limits are relaxed. The experiments were conducted on a variety of model sizes, demonstrating the scalability of the approach. Figure 2

Figure 2: ZB-V schedule.

The zero bubble schedules present clear benefits in terms of throughput and memory usage. The "Zero Bubble V" (ZB-V) schedule, depicted in Figure 2, achieves comparable throughput to \zb{2} but uses significantly less memory, making it a practical choice for resource-constrained environments. Figure 3

Figure 3: The relation between memory limit and bubble rate for ZB-V, compared with the heuristic method.

Conclusion

The introduction of a zero-bubble scheduling strategy marks a substantial advancement in pipeline parallelism. By eliminating pipeline bubbles, this method optimizes the use of GPUs in distributed training settings, ultimately reducing training times and costs. The research not only provides a theoretical basis for bubble-free pipeline scheduling but also demonstrates practical implementations that can be adopted in existing frameworks, as evidenced by their work with the Megatron-LM repository. Future developments may focus on integrating these techniques with other parallelism strategies, targeting more complex architectures and further improving training efficiency on a large scale.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 476 likes about this paper.

HackerNews