Emergent Mind

Token Dropping for Efficient BERT Pretraining

(2203.13240)
Published Mar 24, 2022 in cs.CL and cs.LG

Abstract

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.

Overview

  • The paper introduces a new approach for efficient BERT pretraining called 'Token Dropping', aiming at reducing computational costs without sacrificing model performance by dynamically dropping 'unimportant' tokens.

  • Token importance is estimated using MLM loss, with less important tokens being dropped in intermediate layers while ensuring essential tokens are always processed.

  • Experimental results show a 25% reduction in pretraining costs with comparable or slightly improved performance on downstream tasks like GLUE and SQuAD.

  • The approach suggests a shift from conventional all-token processing, with prospects for extending the methodology to other transformer models and exploring its effectiveness across languages.

Efficient BERT Pretraining via Token Dropping

Introduction

In the pursuit of efficient pretraining of transformer models like BERT, a new methodology titled "Token Dropping for Efficient BERT Pretraining" proposes an innovative approach to reduce computational cost without compromising the model's performance on downstream tasks. This method, based on dynamically identifying and dropping 'unimportant' tokens during the intermediate stages of training, yields a significant reduction in pretraining time.

Methodology

The core strategy involves a "token dropping" mechanism across certain layers of the model during pretraining. The main contributions can be distilled into two key aspects:

  • Token Importance Estimation: A novel scheme to determine the importance of tokens using the masked language modeling (MLM) loss. Tokens with historically lower MLM losses are deemed less important and are candidates for dropping in specific layers.
  • Layer-Specific Token Processing: The architecture retains full token processing in initial and final layers but selectively drops tokens in intermediate layers. This arrangement ensures the model's exposure to all tokens, while still concentrating computational efforts on tokens identified as 'important'.

Token Dropping Mechanism

The process commences by retaining all tokens in the first few layers (denoted as $Lf$) and the very last layer, ensuring initial contextual understanding and final output integrity. In the intermediate layers ($Lh$), tokens with lower importance scores are dropped. The importance of tokens is dynamically updated throughout the training based on the MLM loss, with a mechanism in place to ensure essential tokens such as \texttt{[CLS]}, \texttt{[MASK]}, and \texttt{[SEP]} are always considered important.

Experimental Evaluation

Setup

Experiments were conducted using BERT on the standard BooksCorpus and English Wikipedia datasets. The study thoroughly investigates the effects of token dropping across various setups including BERT-base and BERT-large models, with and without additional stage-2 pretraining (where no token dropping occurs).

Findings

  • Pretraining Efficiency: Token dropping led to a 25\% reduction in pretraining costs while maintaining or slightly improving performance across a range of downstream tasks including GLUE and SQuAD benchmarks.
  • Importance Estimation Validation: The dynamic approach of estimating token importance using cumulative MLM loss outperformed static methods, such as frequency-based dropping. Interestingly, adding randomization in token importance did not yield improvements, underlining the efficacy of the MLM loss-based importance estimation.
  • Layer-Specific Dropping Insights: The strategy of dropping tokens exclusively in intermediate layers was validated, with an arrangement of retaining full token processing in the initial and the last layer proving to be optimal.

Implications and Future Work

The proposed token dropping strategy opens up new avenues for efficiently pretraining LLMs. By demonstrating that selective focus on 'important' tokens during certain phases of pretraining can yield significant computational savings without sacrificing model performance, the approach challenges conventional all-token processing methods.

Future explorations might focus on extending this token dropping paradigm to other transformer architectures, including those designed for longer contexts such as the Longformer. Given the methodology's initial validation on English language tasks, assessing its applicability and efficiency across diverse languages presents an intriguing area for further research.

Conclusion

"Token Dropping for Efficient BERT Pretraining" introduces a pragmatic and effective strategy for reducing the computational overhead of BERT pretraining. By judiciously focusing computational resources on tokens identified as 'important', the approach achieves a compelling balance between efficiency and performance. This method not only marks a significant step forward in pretraining efficiency but also lays the groundwork for future advancements in building more sustainable and cost-effective natural language processing models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.