The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry (2402.04347v1)

Published 6 Feb 2024 in cs.LG and cs.CL

Abstract: Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as LLMs into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.

References (41)

Citations (31)

View on Semantic Scholar

Summary

The paper presents a novel approach using trainable single-layer MLPs to recreate the spiky, monotonic weight distributions of softmax attention.
Empirical evaluations show Hedgehog nearly recovers standard Transformer quality, improving performance on benchmarks like WikiText-103 and GLUE.
The method scales effectively across varied tasks and model sizes, demonstrating superior efficiency and adaptability compared to previous linear attention models.

Introduction

Linear attention mechanisms within Transformers propose the exciting potential to replace traditional softmax attention, which has a quadratic computational complexity with respect to the sequence length, with linear complexity alternatives. Despite these efficiency benefits, previously devised linear attentions often resulted in a substantially reduced model quality when compared to their softmax attention counterparts.

Bridging the Performance Gap

Identifying the crucial elements of softmax attention that linear variants lack, such as low-entropy weight distributions and dot-product monotonicity, the paper introduces an innovative approach. By utilizing trainable single-layer MLPs (multi-layer perceptrons) as feature maps, the proposed method—dubbed Hedgehog—achieves a high-performance linear attention that closely mirrors the qualities of softmax attention, specifically its capability to produce "spiky" and monotonic weights. Hedgehog's approach not only preserves linear computational complexity, but it also demonstrates excellent performance across several regimes, including training from scratch and finetuning.

Empirical Validation

Numerous experiments validate the effectiveness of Hedgehog, showcasing impressive performance that surpasses prior linear attention formulations. In training-from-scratch scenarios, Hedgehog demonstrates its prowess on standard benchmarks such as Long Range Arena (LRA) tasks and WikiText-103 LLMing, significantly closing the performance gap by 68.6% on the latter. In the finetuned-conversion and pretrained-conversion settings, Hedgehog consistently recovers over 99% of the original standard Transformer quality on tasks like Wikipedia text and the GLUE benchmark, convincingly outpacing prior linear attentions by substantial margins, with improvements up to 6 perplexity points and 8.7 GLUE score points, respectively.

Contributions and Scalability

The paper's method presents a compelling case for the practicality and scalability of linear attentions in Transformers, including state-of-the-art results for subquadratic models of a similar size after converting pretrained GPT-2 and significant improvements on the SAMSum summarization task using a scaled-up pretrained Llama-2 7B model. Notably, Hedgehog's attention preserves fidelity with increased sequence lengths and transfers effectively to new tasks, evidencing its adaptability and generalization capability. The findings suggest that by effectively mimicking softmax attention, it's possible to achieve near-equivalent performance with linear complexity, offering a blend of efficiency and expressivity previously unachieved by past linear attentions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/simran_s_arora/status/1845909074774475125

https://twitter.com/_akhaliq/status/1755465317147345112

https://twitter.com/CFGeek/status/1764753769865113678

https://twitter.com/Ji_Ha_Kim/status/1774576659368177762

https://twitter.com/CFGeek/status/1759728714366210215

https://twitter.com/Ethan_smith_20/status/1760497115481219331