Emergent Mind

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

(2402.04347)
Published Feb 6, 2024 in cs.LG and cs.CL

Abstract

Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as LLMs into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.

Hedgehog creates efficient, expressive linear attentions mimicking standard attention in Transformer training.

Overview

  • The paper introduces a new high-performance linear attention method named Hedgehog that mimics key features of softmax attention while maintaining linear computational complexity.

  • Empirical studies show that Hedgehog closely matches or surpasses the performance of traditional softmax attention on various benchmarks and NLP tasks.

  • Hedgehog successfully reduces the performance gap with softmax attention by up to 68.6% on WikiText-103 and recovers over 99% of the Transformer quality on GLUE benchmark.

  • Validation on large-scale settings confirms Hedgehog's scalability and its ability to maintain high fidelity in longer sequences and across different tasks.

Introduction

Linear attention mechanisms within Transformers propose the exciting potential to replace traditional softmax attention, which has a quadratic computational complexity with respect to the sequence length, with linear complexity alternatives. Despite these efficiency benefits, previously devised linear attentions often resulted in a substantially reduced model quality when compared to their softmax attention counterparts.

Bridging the Performance Gap

Identifying the crucial elements of softmax attention that linear variants lack, such as low-entropy weight distributions and dot-product monotonicity, the paper introduces an innovative approach. By utilizing trainable single-layer MLPs (multi-layer perceptrons) as feature maps, the proposed method—dubbed Hedgehog—achieves a high-performance linear attention that closely mirrors the qualities of softmax attention, specifically its capability to produce "spiky" and monotonic weights. Hedgehog's approach not only preserves linear computational complexity, but it also demonstrates excellent performance across several regimes, including training from scratch and finetuning.

Empirical Validation

Numerous experiments validate the effectiveness of Hedgehog, showcasing impressive performance that surpasses prior linear attention formulations. In training-from-scratch scenarios, Hedgehog demonstrates its prowess on standard benchmarks such as Long Range Arena (LRA) tasks and WikiText-103 language modeling, significantly closing the performance gap by 68.6% on the latter. In the finetuned-conversion and pretrained-conversion settings, Hedgehog consistently recovers over 99% of the original standard Transformer quality on tasks like Wikipedia text and the GLUE benchmark, convincingly outpacing prior linear attentions by substantial margins, with improvements up to 6 perplexity points and 8.7 GLUE score points, respectively.

Contributions and Scalability

The paper's method presents a compelling case for the practicality and scalability of linear attentions in Transformers, including state-of-the-art results for subquadratic models of a similar size after converting pretrained GPT-2 and significant improvements on the SAMSum summarization task using a scaled-up pretrained Llama-2 7B model. Notably, Hedgehog's attention preserves fidelity with increased sequence lengths and transfers effectively to new tasks, evidencing its adaptability and generalization capability. The findings suggest that by effectively mimicking softmax attention, it's possible to achieve near-equivalent performance with linear complexity, offering a blend of efficiency and expressivity previously unachieved by past linear attentions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.