Memory-Efficient Pipeline-Parallel DNN Training

Published 16 Jun 2020 in cs.LG, cs.DC, and stat.ML | (2006.09503v3)

Abstract: Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT LLMs by up to 20$\times$ with similar final model accuracy.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (177)

View on Semantic Scholar

Summary

The paper introduces 2BW, a double-buffered weight update system that minimizes memory usage and eliminates pipeline flushes to boost training speed.
It employs automatic model partitioning and workload balancing to optimize resource utilization across hardware with memory constraints.
Empirical results on GPT and BERT models show that 2BW maintains convergence quality while scaling training to models with up to 30 billion parameters.

Overview of Memory-Efficient Pipeline-Parallel DNN Training

The paper introduces "2BW," a system designed to enhance the efficiency of training deep neural networks (DNNs), particularly large-scale models that are challenging to fit within the memory constraints of single accelerator devices. As model complexity, exemplified by architectures like GPT and BERT, continues to grow, scalable training methods become indispensable. The authors propose a method of pipeline parallelism that addresses the limitations posed by existing model parallelism techniques, such as inefficient resource utilization and communication overheads.

2BW integrates a novel approach named "Double-Buffered Weight Updates" (2BW), which mitigates the trade-offs between throughput and memory footprint inherent in traditional model parallelism. This is achieved through sophisticated pipelining strategies and weight gradient coalescing, allowing for an increase in training throughput by up to 20 times while maintaining model accuracy comparable to current standards.

Key Contributions

The paper makes several technical contributions:

Double-Buffered Weight Updates (2BW): The approach minimizes memory footprint while ensuring high throughput by updating weights asynchronously. By maintaining two versions of weights and utilizing a smart scheduling algorithm, 2BW enables efficient training without the expensive pipeline flushes required by some existing methods like GPipe.
Automatic Model Partitioning: 2BW autonomously partitions DNN models across available hardware resources, considering the constraints of memory capacity and interconnect topology. This avoids bottlenecks common in scenarios where model parallelism is naively implemented without regard to hardware-specific constraints.
Pipelining without Flushes: The system achieves low memory overhead and high throughput because it eliminates the need for frequent pipeline flushes, which are required in conventional methods to maintain consistent weight versions.
2BW Planner: The system includes a planning module that effectively determines parallelization schemes by balancing workload distribution across the model's repetitive structures, such as the transformer layers in BERT.

Findings and Experimental Results

The authors evaluated 2BW on GPT and BERT models, with parameter sizes reaching up to 3.9 billion. The experimental results highlight the following:

Throughput Improvements: Compared to non-pipelining baselines, 2BW showed up to a 20-fold increase in training speed for the largest model configurations. It also outperformed GPipe by up to 3.2 times due to the elimination of pipeline flushes and more efficient memory utilization.
Statistical Efficiency: Despite changes in update semantics due to delay terms, the convergence quality of models trained with 2BW parallels that of models trained with standard data-parallel and other pipelining methods.
Scalability: The system can train models with up to 30 billion parameters on conventional hardware setups, suggesting that it could be a viable solution for deploying future, more complex models.

Implications and Future Developments

2BW's contribution to DNN training represents a significant advancement in the scalability of neural network training, particularly for extreme-scale models. The implications for both academia and industry revolve around improved resource utilization and reduced costs in training time and computing resources. This work opens several avenues for future research, including refining the planning algorithms to include more nuanced cost models, and further evolving the double-buffering strategy to handle even larger models.

In conclusion, while the paper stops short of calling its approach revolutionary, the advancements proposed in 2BW provide a robust framework for tackling the current challenges of training large-scale DNNs efficiently. As AI systems continue to scale, methods like 2BW will likely be integral to managing the computational overhead and constrained environments typical of future deployments.

Markdown Report Issue