Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reducing Activation Recomputation in Large Transformer Models (2205.05198v1)

Published 10 May 2022 in cs.LG and cs.CL

Abstract: Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on LLMs up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.

Citations (196)

Summary

  • The paper introduces novel techniques combining sequence parallelism and selective activation recomputation for efficient training of large transformers.
  • It demonstrates a 5x reduction in memory overhead and over 90% decrease in activation recomputation time in models scaling up to 1 trillion parameters.
  • The methods boost model FLOPs utilization and enable training on existing GPU clusters, paving the way for scalable deep learning.

Reducing Activation Recomputation in Large Transformer Models

The paper "Reducing Activation Recomputation in Large Transformer Models" addresses one of the pressing computational challenges in modern AI: the efficient training of large transformer models. The authors introduce novel techniques aimed at reducing the memory overhead associated with activation recomputation, a common strategy employed to circumvent the memory limitations encountered when training models with hundreds of billions to trillions of parameters.

Key Techniques and Contributions

The authors propose two main techniques: sequence parallelism and selective activation recomputation, which are designed to work in conjunction with tensor parallelism. These techniques effectively reduce the reliance on activation recomputation by partitioning the sequence dimension across GPUs, thus significantly decreasing the memory footprint of activations. Through these methods, the paper demonstrates a reduction in activation memory by a factor of five and decreases the execution time overhead from activation recomputation by over 90%.

  1. Sequence Parallelism: This method prevents redundant storage of activations in areas not suitable for standard tensor parallelism. Sequence parallelism divides activations along the sequence length, enabling more efficient use of GPU memory.
  2. Selective Activation Recomputation: Instead of full activation recomputation, the authors advocate for selectively recomputing only those activations that are least costly to recompute in terms of FLOPs, specifically targeting operations with a low operation-to-memory ratio such as Q, K, and V computation in the self-attention mechanism.

In a comparative analysis, these techniques were applied to models up to one trillion parameters in size. The authors reported an increase in model FLOPs utilization to 54.2% for a 530B parameter GPT-3 style model, exhibiting a 29% performance improvement over traditional activation recomputation methods. This is a notable result as it suggests a substantial impact on reducing training time for extremely large-scale LLMs.

Practical and Theoretical Implications

Practically, the integration of sequence parallelism and selective activation recomputation into frameworks such as Megatron-LM and NeMo-Megatron enhances the feasibility of training large models on existing hardware infrastructures, such as clusters consisting of NVIDIA A100 GPUs. This enables the training of models at scales previously unmanageable due to memory constraints.

Theoretically, the paper extends the understanding of model parallelism by introducing novel strategies that optimize memory usage without substantially increasing computational complexity. This work paves the way for further innovations in the efficient scaling of neural networks, fostering future research efforts that could refine activation management techniques or propose entirely new paradigms.

Future Directions

While the current implementation has proven effective, the paper suggests further research into optimizing memory fragmentation and balancing memory usage across pipeline stages. Exploration into other dimensions of parallelism or synergy with existing methods, such as data parallelism and memory offloading strategies, could yield even greater efficiencies.

In summary, this research advances the field of deep learning by addressing core bottlenecks in training infrastructure, enabling the continued growth of model sizes while maintaining feasible training times and resources. This work is instrumental in allowing practitioners to push the boundaries of transformer model capabilities and explore more complex tasks with larger datasets.

Youtube Logo Streamline Icon: https://streamlinehq.com