Decentralized Training of Foundation Models in Heterogeneous Environments (2206.01288v4)

Published 2 Jun 2022 in cs.DC and cs.LG

Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).

References (56)

Citations (75)

View on Semantic Scholar

Summary

The paper's main contribution is a scheduling algorithm using evolutionary optimization to efficiently allocate tasklets across decentralized GPU devices.
It demonstrates a 4.8x faster training time compared to state-of-the-art systems despite operating on networks up to 100 times slower.
The approach enables cost-effective and scalable training of foundation models, democratizing access for resource-constrained researchers.

Decentralized Training of Foundation Models in Heterogeneous Environments

The paper under review presents a novel approach to training large-scale foundation models (FMs), such as GPT-3 and PaLM, in decentralized and heterogeneous computing environments. These models typically require significant computational resources, traditionally sourced from clusters within homogeneous data centers featuring fast interconnects. The research explores whether these computational demands can be met using the distributed and varied capabilities of decentralized computing resources, which have become increasingly prevalent and often underutilized.

Key Contributions

The primary contribution of the paper is the introduction of a new scheduling algorithm optimized for training foundation models in decentralized settings. This algorithm aims to efficiently allocate computational tasks, called "tasklets," across a network of decentralized GPU devices connected by slower, heterogeneous networks. The scheduling algorithm is built upon a formal cost model that considers both data and pipeline parallelism in task allocation, a significant departure from previous approaches that focus mainly on data parallelism for smaller models.

The paper proposes an evolutionary algorithm to determine optimal tasklet allocations, aiming to minimize communication and computational overhead. The proposed algorithm was tested using real-world network measurements across geo-distributed environments simulating connections between decentralized devices.

Experimental Results

The experiments conducted demonstrate the efficiency of the proposed method, particularly under extreme conditions. When deployed across devices in eight cities spanning three continents, the new approach yielded a training time 4.8 times faster than existing state-of-the-art systems. This result underscores the capability of the scheduling algorithm to mitigate the limitations posed by slower and heterogeneous communication networks. Furthermore, the implementation showed only a 1.7 to 3.5 times slowdown compared to homogeneous data center training, despite the network being up to 100 times slower, indicating promising scalability and adaptability in more constrained resource environments.

Implications

The practical implications of this research are significant, suggesting that training large-scale models need not be confined to highly specialized data centers. By leveraging decentralized computational resources, the costs associated with training these models can be dramatically reduced, democratizing access to foundation model development. This could have vast implications for smaller institutions or researchers with limited computing resources, potentially accelerating innovation in machine learning by removing existing economic barriers.

From a theoretical standpoint, the research advances our understanding of model and pipeline parallelism in decentralized settings and provides a framework for future explorations in distributed machine learning. The proposed model opens new avenues for research into optimizing communication and computational processes across disparate and heterogeneous device networks.

Future Directions

The paper acknowledges several limitations and areas for future research. Dynamic scheduling that accounts for changing network conditions and device availabilities remains an open challenge. Additionally, the system currently assumes stable connections and consistent device availability, which may not always be the case in volunteer computing contexts. Developing robust fault-tolerant systems to handle these real-world uncertainties will be essential in future implementations.

In conclusion, this paper presents an innovative approach to decentralized training of foundation models in heterogeneous environments. The results suggest that such methodologies can bridge the gap between centralized and decentralized training, paving the way for more inclusive and economically accessible AI development. This work lays a critical foundation for subsequent investigations into optimizing distributed resources for large-scale ML training tasks.

PDF Markdown

Related Papers

GitHub

GitHub - DS3Lab/DT-FM (94 stars)

Tweets

https://twitter.com/LiaoPeiyuan/status/1862628420762706255