Q8BERT: Quantized 8Bit BERT

Published 14 Oct 2019 in cs.CL and cs.LG | (1910.06188v2)

Abstract: Recently, pre-trained Transformer based LLMs such as BERT and GPT, have shown great improvement in many NLP tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (480)

View on Semantic Scholar

Summary

The paper presents a quantization-aware training technique that compresses BERT to 8-bit with minimal accuracy loss.
The method uses symmetric linear quantization to simulate quantized inference during training for practical efficiency gains.
Empirical results show a fourfold reduction in model size while retaining within 1% accuracy of floating-point benchmarks.

Q8BERT: Quantized 8Bit BERT

The adaptation of large pre-trained Transformer-based LLMs, such as BERT, into production environments necessitates innovative approaches to reduce their computational and memory demands. The paper "Q8BERT: Quantized 8Bit BERT" addresses this challenge by presenting a method for compressing BERT to enable efficient inference while maintaining minimal accuracy loss. The authors propose a technique involving quantization-aware training during the fine-tuning phase to achieve a compression factor of four through the use of 8bit integer arithmetic.

Methodology

The researchers employ symmetric linear quantization to convert BERT's weights and activations into 8bit integers. The quantization scheme focuses on minimizing the memory footprint and accelerating inference on hardware that supports integer calculations. By simulating quantized inference during training, the proposed quantization-aware training approach enables the model to adapt to quantization errors effectively.

Results and Evaluation

The paper reports empirical results obtained from applying their quantization method across several NLP tasks from the GLUE benchmark and SQuADv1.1. The quantization-aware trained models (denoted as QAT) achieved accuracy within 1% of the floating-point benchmarks for most tasks, significantly outperforming dynamically quantized models (DQ) which were subject to higher accuracy degradation. Notably, reductions in model size by a factor of four were achieved with negligible loss in performance.

Implications and Future Directions

The quantization approach detailed in the paper has significant implications for deploying NLP models in resource-constrained environments. By reducing the model size and enabling faster inference through the exploitation of 8bit integer arithmetic, this method supports the development of NLP applications with low latency requirements on a wide range of computational platforms.

Future directions suggested by the authors include further exploration of model compression techniques that could complement quantization, potentially offering additional reductions in memory usage and power consumption. Such advancements are crucial for deploying models like BERT in scenarios with strict constraints on computational resources.

By integrating these techniques into NLP systems, the efficiency gains may also contribute to the broader field of AI by enabling more sustainable and accessible deployment of advanced LLMs, facilitating their use across diverse applications worldwide.

Markdown Report Issue