Extremely Small BERT Models from Mixed-Vocabulary Training (1909.11687v2)

Published 25 Sep 2019 in cs.CL

Abstract: Pretrained LLMs like BERT have achieved good results on NLP tasks, but are impractical on resource-limited devices due to memory footprint. A large fraction of this footprint comes from the input embeddings with large input vocabulary and embedding dimensions. Existing knowledge distillation methods used for model compression cannot be directly applied to train student models with reduced vocabulary sizes. To this end, we propose a distillation method to align the teacher and student embeddings via mixed-vocabulary training. Our method compresses BERT-LARGE to a task-agnostic model with smaller vocabulary and hidden dimensions, which is an order of magnitude smaller than other distilled BERT models and offers a better size-accuracy trade-off on language understanding benchmarks as well as a practical dialogue task.

Citations (53)

View on Semantic Scholar

Summary

The paper presents a two-stage mixed-vocabulary distillation method that reduces BERT model size while maintaining competitive benchmark performance.
It aligns teacher and student embeddings without requiring matching output spaces, achieving significant compression by reducing vocabulary size.
Experiments on GLUE benchmarks and spoken language tasks demonstrate that the distilled models offer a favorable size-accuracy trade-off.

Extremely Small BERT Models from Mixed-Vocabulary Training

Introduction

Pretraining LLMs such as BERT has significantly advanced the field of NLP. However, their deployment on resource-constrained devices remains a challenge due to their substantial memory requirements, primarily driven by large input embeddings. Addressing this issue, the discussed paper introduces a distillation method that significantly reduces the size of BERT models by employing training with mixed vocabularies. This novel approach aims to create compact student BERT models that maintain a competitive performance while being remarkably smaller in size.

Related Work

The paper situates itself within the continuum of NLP model compression research, elaborating on four primary strategies: matrix approximation, weight quantization, pruning/sharing, and knowledge distillation. While each method contributes to the model compression paradigm, knowledge distillation—transferring knowledge from a large teacher model to a smaller student model—serves as the primary foundation for this research. The paper differentiates itself by focusing on reducing the vocabulary size, a less explored space in the context of task-agnostic model distillation, especially for BERT models.

Proposed Approach

The core of the proposed methodology is a two-stage distillation process facilitated by mixed-vocabulary training. This process starts with training teacher BERT model embeddings with a mix of teacher and student tokenized words, followed by refining the student model with these initialized embeddings. The significant contribution of this approach is the alignment of teacher and student embeddings without requiring the student and teacher models to have compatible output spaces, thereby enabling a substantial reduction in vocabulary and model size.

Key Highlights:

Mixed-vocabulary training enables distilling BERT models with significantly reduced vocabulary sizes without sacrificing performance.
Reduced-vocabulary student models (6 and 12-layer) are trained, achieving size-efficiency while retaining a competitive accuracy on benchmark tasks against other distilled models.
The developed models offer a more favorable size-accuracy trade-off for language understanding benchmarks and practical dialogue tasks.

Experiments and Results

The distillation efficacy is evaluated across a spectrum of tasks from the GLUE benchmark and a practical spoken language understanding task. These experiments are comprehensive, involving models with different layers, embedding/hidden dimensions, and employing a reduced 5K WordPiece vocabulary.

Performance:

The 6 and 12-layer mixed-vocabulary distilled models show admirable performance while being up to an order of magnitude smaller than other BERT distillation attempts.
The models outperform or competitively match the baselines, including state-of-the-art distillation approaches, evidencing the effectiveness of the mixed-vocabulary distillation method.

Discussion

The paper addresses several facets of distillation, such as the impact of vocabulary size on model performance, alternative vocabulary pruning strategies, and the robustness of mixed-vocabulary training over traditional distillation methods. Notably, it observes that smaller WordPiece vocabularies are nearly as effective for sequence classification and tagging tasks, especially with smaller BERT model dimensions.

Conclusion

The presented research introduces a strategic advancement in NLP model compression through a mixed-vocabulary distillation approach. This work not only contributes to the ongoing dialogue on making large-scale NLP models more accessible for deployment on resource-constrained platforms but also paves the way for further optimizations in model size without considerable performance trade-offs. Future work might explore combining this approach with other compression techniques to achieve even smaller and more efficient BERT models, providing a promising direction for on-device deployment of state-of-the-art NLP models.

PDF Markdown