Decentralized Training of Foundation Models in Heterogeneous Environments (2206.01288v4)
Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Fairscale: A general purpose modular pytorch library for high performance and large scale training, 2021.
- Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021.
- Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023, 2022.
- Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
- Q4’21 sees a nominal rise in gpu and pc shipments quarter-to-quarter. https://www.jonpeddie.com/press-releases/q421-sees-a-nominal-rise-in-gpu-and-pc-shipments-quarter-to-quarter.
- Screen savers of the world unite! Science, 290(5498):1903–1904, 2000.
- OS Statistics. https://stats.foldingathome.org/os, 2022. [Online; accessed 15-May-2022].
- Gpu economics cost analysis. https://venturebeat.com/2018/02/25/the-real-cost-of-mining-ethereum/.
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30, 2017.
- Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478–3487. PMLR, 2019.
- A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
- Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020.
- Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021.
- Glam: Efficient scaling of language models with mixture-of-experts. CoRR, abs/2112.06905, 2021. URL https://arxiv.org/abs/2112.06905.
- Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
- Efficient algorithms for device placement of dnn graph operators. Advances in Neural Information Processing Systems, 33:15451–15463, 2020.
- Piper: Multidimensional planner for dnn parallelization. Advances in Neural Information Processing Systems, 34, 2021.
- Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
- {{\{{HetPipe}}\}}: Enabling large {{\{{DNN}}\}} training on (whimpy) heterogeneous {{\{{GPU}}\}} clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 307–321, 2020.
- Christos H Papadimitriou. The euclidean travelling salesman problem is np-complete. Theoretical computer science, 4(3):237–244, 1977.
- A hybrid genetic algorithm for multiway graph partitioning. In Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation, pages 159–166. Citeseer, 2000.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
- A unified architecture for accelerating distributed {{\{{DNN}}\}} training in heterogeneous {{\{{GPU/CPU}}\}} clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 463–479, 2020.
- Michel X Goemans. Lecture notes on bipartite matching. Massachusetts Institute of Technology, 2009.
- Computers and intractability, volume 174. freeman San Francisco, 1979.
- Balanced graph partitioning. Theory of Computing Systems, 39(6):929–939, 2006.
- Think locally, act globally: Highly balanced graph partitioning. In International Symposium on Experimental Algorithms, pages 164–175. Springer, 2013.
- Recent advances in graph partitioning. Algorithm engineering, pages 117–158, 2016.
- Hybrid genetic algorithms: A review. Eng. Lett., 13(2):124–137, 2006.
- Genetic algorithm and graph partitioning. IEEE Transactions on computers, 45(7):841–855, 1996.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Pipe-sgd: a decentralized pipelined sgd framework for distributed deep net training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8056–8067, 2018.
- Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3043–3052. PMLR, 2018.
- Communication compression for decentralized training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 7663–7673, 2018a.
- D2: Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4848–4856. PMLR, 2018b.
- Swarm parallelism: Training large models can be surprisingly communication-efficient. 2023.
- Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022.
- David P Anderson. Boinc: A system for public-resource computing and storage. In Fifth IEEE/ACM international workshop on grid computing, pages 4–10. IEEE, 2004.
- Reinforcement learning in dynamic task scheduling: A review. SN Computer Science, 1(6):1–17, 2020.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Pytorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12):3005–3018, 2020.
- Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007.
- Nccl. https://developer.nvidia.com/nccl.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
- Strongswan vpn. https://www.strongswan.org/.
- Fluidstack. https://www.fluidstack.io/.