Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Extremely Small BERT Models from Mixed-Vocabulary Training (1909.11687v2)

Published 25 Sep 2019 in cs.CL

Abstract: Pretrained LLMs like BERT have achieved good results on NLP tasks, but are impractical on resource-limited devices due to memory footprint. A large fraction of this footprint comes from the input embeddings with large input vocabulary and embedding dimensions. Existing knowledge distillation methods used for model compression cannot be directly applied to train student models with reduced vocabulary sizes. To this end, we propose a distillation method to align the teacher and student embeddings via mixed-vocabulary training. Our method compresses BERT-LARGE to a task-agnostic model with smaller vocabulary and hidden dimensions, which is an order of magnitude smaller than other distilled BERT models and offers a better size-accuracy trade-off on language understanding benchmarks as well as a practical dialogue task.

Citations (53)

Summary

  • The paper presents a two-stage mixed-vocabulary distillation method that reduces BERT model size while maintaining competitive benchmark performance.
  • It aligns teacher and student embeddings without requiring matching output spaces, achieving significant compression by reducing vocabulary size.
  • Experiments on GLUE benchmarks and spoken language tasks demonstrate that the distilled models offer a favorable size-accuracy trade-off.

Extremely Small BERT Models from Mixed-Vocabulary Training

Introduction

Pretraining LLMs such as BERT has significantly advanced the field of NLP. However, their deployment on resource-constrained devices remains a challenge due to their substantial memory requirements, primarily driven by large input embeddings. Addressing this issue, the discussed paper introduces a distillation method that significantly reduces the size of BERT models by employing training with mixed vocabularies. This novel approach aims to create compact student BERT models that maintain a competitive performance while being remarkably smaller in size.

The paper situates itself within the continuum of NLP model compression research, elaborating on four primary strategies: matrix approximation, weight quantization, pruning/sharing, and knowledge distillation. While each method contributes to the model compression paradigm, knowledge distillation—transferring knowledge from a large teacher model to a smaller student model—serves as the primary foundation for this research. The paper differentiates itself by focusing on reducing the vocabulary size, a less explored space in the context of task-agnostic model distillation, especially for BERT models.

Proposed Approach

The core of the proposed methodology is a two-stage distillation process facilitated by mixed-vocabulary training. This process starts with training teacher BERT model embeddings with a mix of teacher and student tokenized words, followed by refining the student model with these initialized embeddings. The significant contribution of this approach is the alignment of teacher and student embeddings without requiring the student and teacher models to have compatible output spaces, thereby enabling a substantial reduction in vocabulary and model size.

Key Highlights:

  • Mixed-vocabulary training enables distilling BERT models with significantly reduced vocabulary sizes without sacrificing performance.
  • Reduced-vocabulary student models (6 and 12-layer) are trained, achieving size-efficiency while retaining a competitive accuracy on benchmark tasks against other distilled models.
  • The developed models offer a more favorable size-accuracy trade-off for language understanding benchmarks and practical dialogue tasks.

Experiments and Results

The distillation efficacy is evaluated across a spectrum of tasks from the GLUE benchmark and a practical spoken language understanding task. These experiments are comprehensive, involving models with different layers, embedding/hidden dimensions, and employing a reduced 5K WordPiece vocabulary.

Performance:

  • The 6 and 12-layer mixed-vocabulary distilled models show admirable performance while being up to an order of magnitude smaller than other BERT distillation attempts.
  • The models outperform or competitively match the baselines, including state-of-the-art distillation approaches, evidencing the effectiveness of the mixed-vocabulary distillation method.

Discussion

The paper addresses several facets of distillation, such as the impact of vocabulary size on model performance, alternative vocabulary pruning strategies, and the robustness of mixed-vocabulary training over traditional distillation methods. Notably, it observes that smaller WordPiece vocabularies are nearly as effective for sequence classification and tagging tasks, especially with smaller BERT model dimensions.

Conclusion

The presented research introduces a strategic advancement in NLP model compression through a mixed-vocabulary distillation approach. This work not only contributes to the ongoing dialogue on making large-scale NLP models more accessible for deployment on resource-constrained platforms but also paves the way for further optimizations in model size without considerable performance trade-offs. Future work might explore combining this approach with other compression techniques to achieve even smaller and more efficient BERT models, providing a promising direction for on-device deployment of state-of-the-art NLP models.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube