SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference (2303.09266v2)

Published 16 Mar 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Dynamic early exiting has been proven to improve the inference speed of the pre-trained LLM like BERT. However, all samples must go through all consecutive layers before early exiting and more complex samples usually go through more layers, which still exists redundant computation. In this paper, we propose a novel dynamic early exiting combined with layer skipping for BERT inference named SmartBERT, which adds a skipping gate and an exiting operator into each layer of BERT. SmartBERT can adaptively skip some layers and adaptively choose whether to exit. Besides, we propose cross-layer contrastive learning and combine it into our training phases to boost the intermediate layers and classifiers which would be beneficial for early exiting. To keep the consistent usage of skipping gates between training and inference phases, we propose a hard weight mechanism during training phase. We conduct experiments on eight classification datasets of the GLUE benchmark. Experimental results show that SmartBERT achieves 2-3x computation reduction with minimal accuracy drops compared with BERT and our method outperforms previous methods in both efficiency and accuracy. Moreover, in some complex datasets like RTE and WNLI, we prove that the early exiting based on entropy hardly works, and the skipping mechanism is essential for reducing computation.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces SmartBERT, which reduces computational costs by dynamically skipping and exiting layers, achieving a 2-3x reduction in FLOPs.
It employs a skipping gate and early exiting classifier in each BERT layer to bypass redundant computations based on the complexity of the input.
The study utilizes cross-layer contrastive learning and a hard weight mechanism to ensure consistency and maintain accuracy across various NLP tasks on the GLUE benchmark.

SmartBERT: Dynamic Early Exiting and Layer Skipping for BERT Inference Acceleration

Introduction

The paper introduces SmartBERT, an enhancement of the BERT model that incorporates both a dynamic early exiting mechanism and a novel layer skipping strategy to expedite the inference phase. The authors present SmartBERT as a solution to reduce the computational load inherent in BERT and similar large-scale PLMs by adapting the inference process according to the complexity of data samples. Through strategic skipping and exiting of layers, SmartBERT achieves notable reductions in computational requirements while maintaining robust performance metrics.

Methodology

The proposed SmartBERT architecture revolves around integrating a skipping gate and an exiting operator into each layer of the BERT model. These components collectively enable the model to dynamically skip unnecessary layers and exit early when decision confidence is adequately high, thus decreasing computational redundancy.

Layer Skipping and Early Exiting: Each BERT layer in SmartBERT has a corresponding skipping gate and an early exiting classifier. The skipping gate uses a learned function to determine if the current layer's computations can be bypassed, while the early exiting classifier assesses whether the output at any layer is conclusive enough to terminate further processing.
Training with Cross-Layer Contrastive Learning: During training, SmartBERT employs cross-layer contrastive learning, a technique designed to enhance the coherence and discriminative power of intermediate representations. This is achieved by maximizing the similarity between representations of the same input across consecutive layers, ensuring that meaningful information is preserved even in earlier exits.
Consistency via Hard Weight Mechanism: To address the disparities between training and inference phases in handling gate outputs, a hard weight mechanism is adopted. This mechanism ensures that decisions made by the skipping gates are consistent and reliable during both training and real-world inference scenarios.

Experimental Results

Experiments were conducted on eight datasets from the GLUE benchmark, demonstrating SmartBERT's computational efficiency and accuracy retention:

Efficiency: SmartBERT achieves a 2-3x reduction in FLOPs compared to baseline BERT, providing a significant speedup in inference time.
Accuracy: The model maintains comparable accuracy to BERT on various NLP tasks, occasionally surpassing it under the same computational constraints.
The experiments highlighted specific datasets like RTE and WNLI where traditional early exiting strategies faltered due to high complexity, showcasing the added value of SmartBERT’s layer skipping.

Implications and Future Work

SmartBERT exemplifies an effective methodology for mitigating computational costs associated with transformer models without compromising performance. The proposed techniques, specifically layer skipping, introduce a new dimension of flexibility and adaptability in neural network inference strategies.

The exploration of SmartBERT could extend to other PLMs, adapting its principles for model-specific constraints and tasks. Additionally, further integration with other model compression techniques such as quantization or pruning could amplify its efficiency gains.

Conclusion

SmartBERT successfully combines dynamic early exiting with layer skipping to optimize BERT inference, achieving noticeable reductions in computational cost while retaining model effectiveness. This dual approach sets a new precedent for efficient model deployment in resource-constrained environments and offers a blueprint for future innovations in adaptive inference models. Implementing SmartBERT can significantly benefit scenarios demanding rapid or real-time information processing without sacrificing the depth of analysis traditionally provided by models like BERT.