Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 27 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 117 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4 34 tok/s Pro
2000 character limit reached

Breadth-First Pipeline Parallelism (2211.05953v2)

Published 11 Nov 2022 in cs.DC, cs.AI, cs.CL, and cs.LG

Abstract: We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  2. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
  3. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL https://arxiv.org/abs/1810.04805.
  5. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021. URL https://arxiv.org/abs/2101.03961.
  6. Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677, 2017.
  7. Pipedream: Fast and efficient pipeline parallel dnn training, 2018. URL https://arxiv.org/abs/1806.03377.
  8. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.
  9. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2018. URL https://arxiv.org/abs/1811.06965.
  10. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
  11. Reducing activation recomputation in large transformer models, 2022. URL https://arxiv.org/abs/2205.05198.
  12. Chimera. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, nov 2021. doi: 10.1145/3458817.3476145. URL https://doi.org/10.1145%2F3458817.3476145.
  13. An empirical model of large-batch training, 2018. URL https://arxiv.org/abs/1812.06162.
  14. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URL https://arxiv.org/abs/2104.04473.
  15. Language models are unsupervised multitask learners, 2019.
  16. Zero: Memory optimizations toward training trillion parameter models, 2019. URL https://arxiv.org/abs/1910.02054.
  17. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, 2021. URL https://arxiv.org/abs/2104.07857.
  18. Measuring the effects of data parallelism on neural network training, 2018. URL https://arxiv.org/abs/1811.03600.
  19. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
  20. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019. URL https://arxiv.org/abs/1909.08053.
  21. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022. URL https://arxiv.org/abs/2201.11990.
  22. Don’t decay the learning rate, increase the batch size. ArXiv, abs/1711.00489, 2018.
  23. Attention is all you need, 2017. URL https://arxiv.org/abs/1706.03762.
  24. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)

Youtube Logo Streamline Icon: https://streamlinehq.com