Mesa: A Memory-saving Training Framework for Transformers

Published 22 Nov 2021 in cs.CV and cs.LG | (2111.11124v3)

Abstract: There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all intermediate activations that are needed for gradient computation during backpropagation, especially for long sequences. To this end, we present Mesa, a memory-saving training framework for Transformers. Specifically, Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. The low-precision activations are then dequantized during back-propagation to compute gradients. Besides, to address the heterogeneous activation distributions in the multi-head self-attention layers, we propose a head-wise activation quantization strategy, which quantizes activations based on the statistics of each head to minimize the approximation error. To further boost training efficiency, we learn quantization parameters by running estimates. More importantly, by re-investing the saved memory in employing a larger batch size or scaling up model size, we may further improve the performance under constrained computational resources. Extensive experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can achieve flexible memory-savings (up to 50%) during training while achieving comparable or even better performance. Code is available at https://github.com/ziplab/Mesa.

Abstract PDF Upgrade to Chat

Citations (18)

View on Semantic Scholar

Summary

The paper demonstrates that Mesa cuts memory usage by up to 50% via innovative low-precision training and head-wise activation quantization.
The methodology efficiently dequantizes stored activations during backpropagation, maintaining gradient fidelity with minimal overhead.
Mesa’s plug-and-play design and use of running estimates for quantization parameters ensure compatibility with various transformer and vision models.

Overview of Mesa: A Memory-Saving Training Framework for Transformers

The paper "Mesa: A Memory-saving Training Framework for Transformers" addresses a significant challenge in the use of transformers for deep learning applications: the extensive memory requirements during training. As transformer models continue to grow in size to achieve higher performance, their training process becomes prohibitively memory-intensive, especially for users with limited computational resources. This research introduces Mesa, a framework designed to mitigate this issue through efficient memory management without compromising training efficacy.

Key Contributions

Memory-efficiency through Low-precision Training: Mesa employs a novel strategy of using high-precision activations during the forward pass while storing low-precision versions for backward propagation. During backpropagation, these stored activations are dequantized to compute gradients effectively. This quantization approach significantly reduces memory consumption and achieves flexibility in memory savings up to 50%.
Head-wise Activation Quantization: Due to the varied activation distributions in the multi-head self-attention (MSA) layers of transformers, a generic quantization approach can cause performance degradation. The paper proposes a head-wise quantization strategy, which tailors the quantization parameters to the statistics of each head, thereby minimizing approximation errors and maintaining gradient fidelity.
Learning Quantization Parameters with Running Estimates: Instead of relying on costly per-sample statistics or gradient-based methods for quantization parameters, this framework uses running estimates to update these parameters efficiently. This minimizes additional memory and computational overhead while maintaining robust training speed and performance.
Practical Implementation and Usability: Mesa is implemented as a plug-and-play module, compatible with CUDA and adaptable to a wide range of transformer architectures. This implementation not only covers transformer-specific operations like MatMul, Softmax, GELU, and LayerNorm but also extends to convolutional layers for vision tasks.

Experimental Validation

Comprehensive experiments on various datasets, including ImageNet, CIFAR-100, and ADE20K, empirically demonstrate that the Mesa framework can reduce memory usage by approximately half during training. In parallel, it either maintains or slightly enhances model performance compared to conventional methods. When combined with existing memory-saving techniques such as automatic mixed-precision (AMP) and checkpointing, Mesa further alleviates the computational barriers for training large-scale models.

Implications and Future Directions

The implications of this research are twofold. Practically, Mesa opens up the possibility for a wider range of researchers to engage with large-scale transformer models, making high-performance training accessible despite hardware limitations. Theoretically, it extends the understanding of activation quantization in the context of neural networks, offering insights into designing more elaborate strategies that leverage heterogeneous activation distributions for efficiency gains.

Looking forward, the adoption of Mesa or similar memory-efficient techniques could significantly influence model design and deployment, especially in fields requiring resource-constrained environments. Future work may explore extending these concepts to even lower quantization bit-widths, including investigating the trade-offs between quantization granularity and model generalization. As this research illustrates, the refinement of memory-efficient algorithms is crucial in the trajectory towards increasingly sophisticated AI models.

Markdown Report Issue