TernaryBERT: Distillation-aware Ultra-low Bit BERT

Published 27 Sep 2020 in cs.CL, cs.LG, cs.SD, and eess.AS | (2009.12812v3)

Abstract: Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks.However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT. Moreover, to reduce the accuracy degradation caused by the lower capacity of low bits, we leverage the knowledge distillation technique in the training process. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the full-precision model while being 14.9x smaller.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (196)

View on Semantic Scholar

Summary

The paper introduces TernaryBERT, which compresses BERT using 2-bit ternarization combined with knowledge distillation to maintain performance.
It employs both approximation-based and loss-aware techniques to quantize weights to {-1, 0, +1} across embeddings and Transformer layers.
Empirical results on GLUE and SQuAD show that TernaryBERT achieves accuracy comparable to full-precision models while being nearly 15 times smaller.

TernaryBERT: Distillation-aware Ultra-low Bit BERT

The paper introduces TernaryBERT, which focuses on compressing the well-known BERT model, a Transformer-based pre-training model, to facilitate its deployment on resource-constrained devices like mobile phones. The core challenge it addresses is the computational and memory burden of deploying LLMs, such as BERT, which possess hundreds of millions of parameters.

Methodology

TernaryBERT applies ternarization to BERT’s weights, converting them into ultra-low, 2-bit representations without altering the architecture. Both approximation-based and loss-aware ternarization techniques are employed, allowing the weights to take values from {-1, 0, +1}. Given that ultra-low bit quantization truncates the model capacity, the paper leverages knowledge distillation to mitigate accuracy loss. Here, knowledge from a high-performance "teacher" model guides a lesser capacity "student" model to ensure performance aids ternarization without sacrificing BERT’s power significantly.

Key Results

Empirical findings on the GLUE benchmark and SQuAD indicate that TernaryBERT not only outperforms existing BERT quantization methods but achieves performance comparable to full-precision models, although the sizes are 14.9 times smaller. This constitutes an impressive advancement, particularly when compared to previous 2-bit models whose performance dropped significantly in natural language processing tasks.

Technical Insights

The paper applies two granularities of ternarization—row-wise for word embeddings and layer-wise for Transformer layer weights. Notably, row-wise ternarization for word embeddings captures semantic richness better. The distribution of activations is skewed negatively, thereby favoring min-max over symmetric 8-bit quantization to capture finer resolution in non-symmetric data distributions.

Implications and Future Work

The innovations in this paper have substantive implications for deploying BERT-like models on edge devices that are memory and computationally constrained. By demonstrating the efficacy of ternarization paired with distillation, the research paves the way for more economical and greener AI applications. Future directions could expand on these quantization methodologies across a broader array of models, optimizing balance between performance and efficiency even further. Extending such techniques to other architectures beyond transformers may also prove fruitful.

In conclusion, TernaryBERT embodies a significant stride towards ultra-efficient model deployment, preserving robust natural language understanding capabilities despite drastic reductions in computational and memory footprint.

Markdown Report Issue