Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 188 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Token Dropping for Efficient BERT Pretraining (2203.13240v1)

Published 24 Mar 2022 in cs.CL and cs.LG

Abstract: Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked LLMing (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

Citations (37)

View on Semantic Scholar

Summary

The paper proposes a token dropping mechanism that dynamically identifies less important tokens using cumulative MLM loss, enabling efficiency gains.
It retains full token processing in the first and last layers while dropping tokens in intermediate layers, achieving a 25% reduction in pretraining cost.
The approach maintains or slightly improves performance on GLUE and SQuAD benchmarks, paving the way for more sustainable transformer models.

Efficient BERT Pretraining via Token Dropping

Introduction

In the pursuit of efficient pretraining of transformer models like BERT, a new methodology titled "Token Dropping for Efficient BERT Pretraining" proposes an innovative approach to reduce computational cost without compromising the model's performance on downstream tasks. This method, based on dynamically identifying and dropping 'unimportant' tokens during the intermediate stages of training, yields a significant reduction in pretraining time.

Methodology

The core strategy involves a "token dropping" mechanism across certain layers of the model during pretraining. The main contributions can be distilled into two key aspects:

Token Importance Estimation: A novel scheme to determine the importance of tokens using the masked LLMing (MLM) loss. Tokens with historically lower MLM losses are deemed less important and are candidates for dropping in specific layers.
Layer-Specific Token Processing: The architecture retains full token processing in initial and final layers but selectively drops tokens in intermediate layers. This arrangement ensures the model's exposure to all tokens, while still concentrating computational efforts on tokens identified as 'important'.

Token Dropping Mechanism

The process commences by retaining all tokens in the first few layers (denoted as $L_f$ ) and the very last layer, ensuring initial contextual understanding and final output integrity. In the intermediate layers ( $L_h$ ), tokens with lower importance scores are dropped. The importance of tokens is dynamically updated throughout the training based on the MLM loss, with a mechanism in place to ensure essential tokens such as \texttt{[CLS]}, \texttt{[MASK]}, and \texttt{[SEP]} are always considered important.

Experimental Evaluation

Setup

Experiments were conducted using BERT on the standard BooksCorpus and English Wikipedia datasets. The paper thoroughly investigates the effects of token dropping across various setups including BERT-base and BERT-large models, with and without additional stage-2 pretraining (where no token dropping occurs).

Findings

Pretraining Efficiency: Token dropping led to a 25\% reduction in pretraining costs while maintaining or slightly improving performance across a range of downstream tasks including GLUE and SQuAD benchmarks.
Importance Estimation Validation: The dynamic approach of estimating token importance using cumulative MLM loss outperformed static methods, such as frequency-based dropping. Interestingly, adding randomization in token importance did not yield improvements, underlining the efficacy of the MLM loss-based importance estimation.
Layer-Specific Dropping Insights: The strategy of dropping tokens exclusively in intermediate layers was validated, with an arrangement of retaining full token processing in the initial and the last layer proving to be optimal.

Implications and Future Work

The proposed token dropping strategy opens up new avenues for efficiently pretraining LLMs. By demonstrating that selective focus on 'important' tokens during certain phases of pretraining can yield significant computational savings without sacrificing model performance, the approach challenges conventional all-token processing methods.

Future explorations might focus on extending this token dropping paradigm to other transformer architectures, including those designed for longer contexts such as the Longformer. Given the methodology's initial validation on English language tasks, assessing its applicability and efficiency across diverse languages presents an intriguing area for further research.

Conclusion

"Token Dropping for Efficient BERT Pretraining" introduces a pragmatic and effective strategy for reducing the computational overhead of BERT pretraining. By judiciously focusing computational resources on tokens identified as 'important', the approach achieves a compelling balance between efficiency and performance. This method not only marks a significant step forward in pretraining efficiency but also lays the groundwork for future advancements in building more sustainable and cost-effective natural language processing models.