Latte: Latent Attention for Linear Time Transformers (2402.17512v4)

Published 27 Feb 2024 in cs.CL and stat.ML

Abstract: The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.

References (26)

Summary

The paper introduces Latte, a linear-time attention mechanism that reduces quadratic complexity by using a fixed set of latent tokens.
It enables both bidirectional and causal applications, demonstrating competitive performance on benchmarks like LRA, OpenWebText, and Enwik8.
The method paves the way for efficient deployment of transformer models in resource-constrained and real-time environments.

Latent Attention for Efficient Transformer Models

Overview of Latent Attention Mechanism

The paper introduces a novel attention mechanism, termed Latte (Latent Attention), designed to significantly reduce the computational complexity associated with the standard attention mechanism used in transformer models. The quintessential challenge it addresses is the quadratic scaling of time and space complexity with the sequence length in traditional attention mechanisms, which inhibits the practical application of transformers to long sequences. Latte accomplishes a linear scaling with sequence length through the introduction of latent vectors that mediate the attention process. This allows for both bidirectional and unidirectional applications, with the causal variant being especially suited for language generation tasks due to its efficient inference capabilities.

Latte Attention Explained

Latte redefines the attention mechanism by comparing sequence elements (tokens) with a fixed set of learned latent tokens, instead of performing all pairwise comparisons between tokens in the sequence. This adjustment not only diminishes computational and memory requirements but also preserves the intuitive understanding of attention—focusing on different parts of the input based on similarity to concepts represented by the latent tokens.

For the non-causal (bidirectional) variant, the mechanism projects input tokens into a latent space where the attention weights are computed as a mixture over latent states, effectively summarizing the input sequence's information. The causal version, crucial for tasks such as language generation, operates similarly but respects the ordering of the sequence, ensuring that only past information is used at each step.

Computational Complexity

One of the central achievements of Latte is its ability to maintain a linear computational complexity concerning both time and space. For bidirectional tasks, this efficiency enables handling significantly longer sequences than conventional attention mechanisms permit. The paper meticulously contrasts the complexity of Latte against the standard attention, demonstrating its superior efficiency without a substantial loss in performance.

In the causal context, Latte's design allows the recursive computation of attention weights and the transformed latent representations, contributing to its efficacy and scalability in generative tasks. Notably, the ability to infer future tokens directly from a compact set of current and past representations signifies a departure from the typical quadratic complexity, presenting a potential for real-time applications in language generation.

Experimental Evaluation

The paper reports empirical evaluations of Latte on a suite of tasks designed to test both bidirectional and unidirectional capabilities. For bidirectional tasks, it leverages the Long-Range Arena (LRA) benchmark, showing competitive or superior performance compared to both the standard transformer and other efficient transformer variants. In language generation scenarios, Latte is tested on the OpenWebText and Enwik8 datasets, manifesting comparable effectiveness to standard transformers while significantly reducing the computational burden.

Implications and Future Directions

The introduction of Latte radically shifts the landscape for designing efficient transformer models, offering a promising direction for both theoretical exploration and practical applications. Its ability to substantially reduce computational requirements without sacrificing performance paves the way for deploying more sophisticated NLP models in resource-constrained environments. Future work could explore the application of Latte in an even broader range of tasks, including those outside NLP, and further optimization of the latent attention mechanism for improved performance and efficiency.

Related Papers

Tweets

https://twitter.com/davidobarber/status/1762840170003390571

https://twitter.com/fly51fly/status/1762958238893703339

https://twitter.com/davidobarber/status/1843572308084158498

https://twitter.com/SonglinYang4/status/1794528091366392148

https://twitter.com/raresdolga97/status/1762795513508729222

https://twitter.com/mattzcarey/status/1843940757348274538