Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Published 11 Jun 2017 in cs.LG, cs.CV, cs.DC, and stat.ML | (1706.03292v1)

Abstract: Deep learning models can take weeks to train on a single GPU-equipped machine, necessitating scaling out DL training to a GPU-cluster. However, current distributed DL implementations can scale poorly due to substantial parameter synchronization over the network, because the high throughput of GPUs allows more data batches to be processed per unit time than CPUs, leading to more frequent network synchronization. We present Poseidon, an efficient communication architecture for distributed DL on GPUs. Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication. Moreover, Poseidon uses a hybrid communication scheme that optimizes the number of bytes required to synchronize each layer, according to layer properties and the number of machines. We show that Poseidon is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow. We show that Poseidon enables Caffe and TensorFlow to achieve 15.5x speed-up on 16 single-GPU machines, even with limited bandwidth (10GbE) and the challenging VGG19-22K network for image classification. Moreover, Poseidon-enabled TensorFlow achieves 31.5x speed-up with 32 single-GPU machines on Inception-V3, a 50% improvement over the open-source TensorFlow (20x speed-up).

Abstract PDF Upgrade to Chat

Authors (10)

Citations (334)

View on Semantic Scholar

Summary

The paper introduces Poseidon, a communication architecture that minimizes synchronization overhead in distributed deep learning through wait-free backpropagation and hybrid communication.
It achieves near-linear speedups with throughput improvements up to 31.5x on 32 GPU nodes, demonstrating scalable performance and efficient bandwidth utilization.
Experimental results on models like GoogLeNet, VGG19, and Inception-V3 show that Poseidon balances high throughput with reliable statistical convergence compared to other approaches.

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

The paper "Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters" introduces an innovative approach to optimize distributed deep learning (DL) training over GPU clusters. The authors present Poseidon, a communication architecture meticulously designed to alleviate challenges associated with the often suboptimal scaling of DL models when distributed over multiple GPUs.

Core Contributions and Methodology

The essence of the paper lies in addressing the core challenge where distributed DL implementations suffer from significant communication overheads. This inefficiency primarily originates from the high throughput of GPU clusters necessitating frequent synchronization of parameters over the network. Poseidon aims to minimize this overhead by proposing two central strategies: wait-free backpropagation (WFBP) and hybrid communication (HybComm).

Wait-Free Backpropagation (WFBP): The WFBP mechanism allows for the overlap of communication and computation, effectively reducing the idle time associated with sequential execution phases in DL training. By leveraging the independencies between computation operations (such as backpropagation steps) and communication tasks (parameter synchronizations), it facilitates efficient pipelining. This strategy is crucial, especially for networks where parameter updates in fully-connected (FC) layers pose significant synchronization challenges.
Hybrid Communication (HybComm): HybComm intelligently selects between parameter server (PS) based communication and sufficient factor broadcasting (SFB), optimizing the synchronization cost based on the layer properties and cluster configuration. This hybrid approach is pivotal in dynamically minimizing the communication overhead without compromising the model's computational efficiency.

Experimental Evaluation

Poseidon demonstrates its robustness and scalability through extensive experiments on varied DL architectures, including GoogLeNet, VGG19, and Inception-V3, using both Caffe and TensorFlow frameworks. Key findings include:

Scalability: Poseidon's architecture enhances scalability, delivering near-linear speedups on up to 32 GPU nodes. This scalability spans across different network configurations and demonstrates efficacy in diverse conditions, with throughput improvements reaching 31.5x for Inception-V3 using a TensorFlow engine over 32 single-GPU machines.
Bandwith Utilization: Through HybComm, Poseidon improves throughput even under constrained bandwidth scenarios, showcasing significantly better utilization of limited resources. For instance, with a 10GbE network, Poseidon achieves near-linear scaling when training complex models like VGG19, a feat traditionally demanding much higher bandwidth.

Comparative Analysis

Poseidon's utility extends beyond just performance improvements. When juxtaposed against other prevalent techniques, such as Microsoft's Adam architecture and CNTK's 1-bit quantization, Poseidon distinctively balances system throughput with statistical convergence. While Adam suffers from communication load imbalances and CNTK potentially compromises accuracy due to quantization, Poseidon offers a coherent solution that maintains statistical efficiency without sacrificing computational throughput.

Implications and Future Trajectories

The implications of this research are significant for both theoretical advancements and practical implementations in parallelized DL frameworks. The adaptability of Poseidon to multiple DL environments suggests potential for integration into existing systems to maximize GPU utilization and minimize training times. As DL models become increasingly complex and data-hungry, architectures like Poseidon offer a scalable path forward, mitigating the synchronization bottlenecks that hinder distributed machine learning's broader adoption.

Looking ahead, Poseidon sets a foundation for further exploration into adaptive communication strategies that could handle even more granular inter-layer dependencies and explore asynchronous training paradigms, thereby broadening its applicability in diverse computational landscapes. As DL continues to permeate new domains, the strategies elucidated in Poseidon will likely underpin future optimizations aimed at bridging the gap between algorithmic advancements and computational feasibility.

In conclusion, the paper succeeds in presenting Poseidon as a viable and efficient solution for enhancing distributed DL on GPU clusters, providing a nuanced understanding of communication overheads and a practical framework for their amelioration.

Markdown Report Issue