Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Published 27 Sep 2021 in cs.LG, cs.AI, and cs.CL | (2109.12948v1)

Abstract: Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme -- per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at~\url{https://github.com/qualcomm-ai-research/transformer-quantization}.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (116)

View on Semantic Scholar

Summary

The paper identifies that high dynamic range activations and structured outliers in transformer residuals hinder effective low-bit quantization.
The paper proposes a three-pronged solution incorporating mixed precision, per-embedding-group quantization, and quantization-aware training to mitigate performance drops.
The paper demonstrates that these techniques achieve significant memory savings with minimal accuracy loss, enabling efficient deployment on resource-constrained devices.

Overview of Efficient Transformer Quantization Challenges and Solutions

The paper "Understanding and Overcoming the Challenges of Efficient Transformer Quantization" thoroughly investigates the challenges associated with quantizing transformer-based models, particularly focusing on architectures like BERT, which have become foundational in NLP tasks. The research highlights that, despite the critical role of these models in various applications, their high memory footprint and latency pose significant obstacles to efficient deployment, particularly on resource-limited devices.

Key Challenges in Transformer Quantization

The authors identified unique challenges in quantizing transformers, primarily due to the high dynamic range of activations, which complicates representation in low-bit fixed-point formats. A notable finding is the presence of structured outliers in the residual connections of these models, which can alter attention patterns, such as overemphasizing the special separator token. Such behavior is prevalent in deeper encoder layers, thereby impeding the prospects of straightforward quantization without severe performance degradation.

Contributions and Proposed Solutions

The paper introduces several technical solutions to mitigate these challenges. These include:

Post-Training Quantization (PTQ) Limitations: Initial experiments revealed that standard 8-bit PTQ leads to marked performance drops, which were especially pronounced in the activation quantization process. The authors identified that this degradation stems from the residual sum after the FFN, where drastic dynamic range mismatches exacerbate quantization noise.
Three-Pronged Solution Approach:
- Mixed Precision PTQ: This technique selectively allocates higher precision to sensitive parts of the network, such as using 16-bit activations for problematic layers while maintaining others at 8-bit to achieve a balance between accuracy and efficiency.
- Per-Embedding-Group Quantization: A novel approach that introduces quantization at the granularity of embedding groups. This strategy centralizes the quantization of particular embedding dimensions that generate outliers, thus preserving model accuracy without significant computational overhead.
- Quantization-Aware Training (QAT): By incorporating quantization steps into the training process, this method allows for adaptation to quantization noise, maintaining the accuracy of the models when deployed in low-bit configurations.

Experimental Validation

The research rigorously evaluates these methods on BERT models using the GLUE benchmark. The results set new state-of-the-art benchmarks for PTQ and QAT, significantly mitigating performance losses typically associated with such quantization methods. Notably, mixed precision and per-embedding-group quantization demonstrated impressive memory savings alongside negligible reductions in model accuracy.

Implications and Future Directions

The implications of this work are multifaceted. Practically, the demonstrated techniques make transformer models viable for deployment in memory and power-constrained environments, such as mobile devices and edge computing platforms. Theoretically, these findings challenge the prevailing assumptions in model quantization and open avenues for further exploration of fine-grained quantization methods that adapt specifically to neural architecture characteristics.

Future developments could explore adaptive quantization schemes, possibly leveraging dynamic quantization that adjusts bit-widths based on real-time computational constraints or task-specific demands. Additionally, integrating these quantization techniques within an automated machine learning pipeline could further generalize their applicability to various neural architectures beyond BERT-like transformers.

In conclusion, the findings and methodologies presented in this study contribute significantly to the field of efficient model inference, providing robust techniques that balance performance and deployability in modern AI systems.

Markdown Report Issue