Linformer: Self-Attention with Linear Complexity

Published 8 Jun 2020 in cs.LG and stat.ML | (2006.04768v3)

Abstract: Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Abstract PDF Upgrade to Chat

Citations (1,482)

View on Semantic Scholar

Summary

The paper presents a novel low-rank approximation method that reduces self-attention complexity from O(n²) to O(n).
It leverages linear projections in key and value computations to simplify the transformer architecture without sacrificing performance.
Empirical results show up to 20x faster processing and lower memory usage, making transformers more viable for resource-constrained applications.

Linformer: Self-Attention with Linear Complexity

The use of large transformer models has catalyzed advancements in various domains of NLP, bringing state-of-the-art results in machine translation, text classification, and question answering, among others. However, the significant resource demands associated with training and deploying these models often pose substantial practical challenges. This paper titled "Linformer: Self-Attention with Linear Complexity" introduces a novel approach to mitigating these issues by approximating the self-attention mechanism with a low-rank matrix, thereby reducing its complexity from $O(n^2)$ to $O(n)$ .

Introduction and Motivation

Transformer models, which hinge on Multi-Head Self-Attention (MHA) mechanisms, efficiently handle long-term dependencies within sequences, giving them an edge over recurrent models for various NLP tasks. Despite their success, transformers encounter a critical bottleneck due to the $O(n^2)$ time and space complexity of the self-attention operation. This quadratic dependency on sequence length $n$ significantly inflates the computational costs, making their deployment resource-intensive. The paper seeks to answer if this quadratic complexity can be optimized without compromising performance.

Several attempts have been made to alleviate the efficiency issues in transformers. Sparse attention models like Longformer and sparse Transformers introduce limited sparsity within attention layers to reduce complexity to $O(n \sqrt{n})$ . The Reformer model employs locally-sensitive hashing (LSH) to bring down complexity to $O(n \log(n))$ . While promising, these models still exhibit limited efficiency gains or increased computational overheads due to sequential hashing operations.

The Linformer model departs from these approaches by exploiting the low-rank property of the self-attention mechanism. The core insight is that the stochastic matrix formed by the self-attention mechanism is inherently low-rank. This observation allows for the simplification of the self-attention mechanism using low-rank approximations, leading to linear time and space complexity.

Theoretical and Empirical Findings

Through a combination of theoretical analysis and empirical validation, the paper demonstrates that the context matrix $P$ in the self-attention mechanism can be effectively approximated by a low-rank matrix. The low-rank nature of $P$ is supported by spectrum analysis, which shows that most of the information in the matrix $P$ is captured within a few largest singular values.

Theoretical Analysis: The authors provide a rigorous theoretical foundation underpinned by the Johnson-Lindenstrauss lemma to validate the low-rank approximation of self-attention. The proofs establish that for suitable choices of projection matrices $E_i$ and $F_i$ , the self-attention can be approximated with an $O(n)$ complexity without significant loss of information.

Model and Implementation

The Linformer introduces linear projections $E_i$ and $F_i$ to the computation of key and value layers in self-attention: $\overline{\text{head}_i} = \text{Attention}(Q W_i^Q, E_i KW_i^K, F_i VW_i^V) = \text{softmax}\left(\frac{Q W_i^Q (E_i KW_i^K)^T}{\sqrt{d_k}}\right) F_i VW_i^V$ The approach ensures that the context mapping matrix $\bar{P}$ is significantly smaller, reducing computational demands. The use of linear projections simplifies the dot-product attention to an $O(n)$ operation.

The Linformer achieves similar or even slightly better performance on downstream tasks when compared to standard transformers while offering substantial reductions in both training and inference time—up to 20 times faster and requiring significantly less memory.

Experimental Results

The empirical validation involves pretraining the Linformer on the BookCorpus and English Wikipedia using the masked-language-modeling objective. Subsequently, models are fine-tuned on several benchmark tasks from GLUE and sentiment analysis on IMDB reviews. The results illustrate comparable performance with significant speed and memory improvements over the standard transformer models.

Efficiency Impact: The authors show that the Linformer sustains its performance even for longer sequence lengths, empirically supporting the claim of linear complexity. Furthermore, parameter sharing strategies between projections are evaluated, reducing memory footprint without degrading model performance.

Implications and Future Work

The Linformer's advancements promise practical implications for the deployment of transformer models in resource-constrained environments, making them viable for real-world applications that require handling long text sequences efficiently. This is particularly relevant for applications in machine translation, automated summarization, and large-scale LLMs where sequence lengths can be extensive.

Future research could explore further optimizing projection matrices and exploring non-linear projection methods such as convolution or attention pooling. Additionally, integrating Linformer into multi-modal models, incorporating visual and linguistic data, could open new frontiers in efficient AI applications.

Conclusion

This paper makes a significant contribution to improving the efficiency of transformer architectures, presenting a novel approach that effectively reduces the self-attention complexity from $O(n^2)$ to $O(n)$ . The theoretical insights and practical performance gains position the Linformer as a robust alternative for deploying transformer models in time-sensitive and resource-limited scenarios. The implications for NLP and broader AI applications are substantial, driving future work on model efficiency and scalability.