Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

Published 14 Oct 2020 in cs.CL and cs.LG | (2010.07003v2)

Abstract: Despite transformers' impressive accuracy, their computational cost is often prohibitive to use with limited computational resources. Most previous approaches to improve inference efficiency require a separate model for each possible computational budget. In this paper, we extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines a sequence length at each layer. We then conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification with Drop-and-Restore process that drops word-vectors temporarily in intermediate layers and restores at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification. Code is available at https://github.com/clovaai/length-adaptive-transformer.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (81)

View on Semantic Scholar

Summary

The paper introduces LengthDrop, a training technique that enables robustness across varied sequence lengths during inference.
It proposes a Drop-and-Restore process to adapt transformers for both token and sequence tasks without extra fine-tuning.
An evolutionary search strategy identifies the best length configurations, achieving significant FLOP savings while preserving accuracy.

Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

The article proposes an innovative method for improving the computational efficiency of transformer models, particularly in resource-constrained environments. The introduced Length-Adaptive Transformer framework achieves flexible inference by dynamically adapting the sequence length at each layer, thus maintaining high accuracy while minimizing computational costs. This paper extends the PoWER-BERT approach and introduces key innovations such as LengthDrop and Drop-and-Restore schemes combined with a multi-objective evolutionary search strategy.

Key Contributions

LengthDrop: This is a novel training technique used to develop a transformer model robust to various sequence lengths during inference. Inspired by structured dropout concepts, LengthDrop stochastically samples the sequence length at each layer during training, thus enabling the model to perform efficiently under different computational budgets without needing re-training or fine-tuning.
Drop-and-Restore Process: This process extends the capabilities of PoWER-BERT beyond sequence-level tasks to token-level tasks. By temporarily dropping word vectors in intermediate layers and restoring them at the final layer, Drop-and-Restore aids in maintaining task-specific layer requirements while still attaining computational savings.
Evolutionary Search: Once the model is trained with LengthDrop, a multi-objective evolutionary algorithm searches for the length configuration that optimizes the trade-off between accuracy and computational efficiency given fixed resource constraints. This technique populates a Pareto frontier, effectively presenting the best configurations for any given scenario.

Numerical Results and Implications

Empirical results underscore the effectiveness of the proposed approach across diverse NLP tasks, including span-based question answering and text classification. For instance, applying the Length-Adaptive Transformer to SQuAD 1.1 using BERT variants showcased superior accuracy-efficiency trade-offs, granting up to 3x efficiency improvements in terms of FLOPs. Additionally, the novel framework allows for more than half-the FLOP savings on MNLI-m and SST-2 benchmarks while maintaining or even slightly improving accuracy over standard, non-length adaptive approaches.

Practical and Theoretical Implications

The research signifies a substantial advancement in the transformer efficiency landscape, particularly in scenarios where varying computational resources might impose differing inference constraints. With the ability to dynamically adjust sequence lengths, this model is particularly suitable for deployment in edge computing and online services where computational resources are limited. Theoretically, LengthDrop and Drop-and-Restore introduce valuable insights and frameworks that could inspire future studies on adaptive architectures, potentially influencing model efficiency across diverse ML domains.

Future Directions

While the presented approach addresses the efficiency bottleneck associated with transformers, future research could explore integrating other scalable dimensions such as adaptive attention heads or parallel architectures. Additionally, testing on broader tasks and datasets could further validate the approach's general applicability. Investigating the combination of Length-Adaptive Transformers with efficient hardware-specific optimizations could also yield fruitful outcomes, narrowing the gap between theoretical efficiency and real-world deployment scenarios.

The paper effectively pushes the envelope on transformer adaptability and efficiency, presenting a comprehensive framework with tangible benefits across varied inference landscapes. As latency considerations become increasingly paramount, methods like this stand as vital contributions to the ongoing evolution of scalable and efficient NLP technologies.

Markdown Report Issue