Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Published 7 Feb 2021 in cs.CL and cs.LG | (2102.03902v3)

Abstract: Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"{o}mformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"{o}m method to approximate standard self-attention with $O(n)$ complexity. The scalability of Nystr\"{o}mformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr\"{o}mformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nystr\"{o}mformer performs favorably relative to other efficient self-attention methods. Our code is available at https://github.com/mlpen/Nystromformer.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (446)

View on Semantic Scholar

Summary

The paper presents Nyströmformer, a novel algorithm that approximates self-attention with linear complexity using the Nyström method.
The methodology leverages landmark point selection and iterative pseudoinverse computation to drastically lower time and memory requirements.
Experimental results demonstrate competitive accuracy on language benchmarks and outperform other efficient self-attention variants in resource usage.

An Overview of Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

The paper presents Nyströmformer, an efficient alternative to the traditional self-attention mechanism in Transformers, designed to address the computational and memory bottlenecks associated with processing long sequences. Self-attention, a critical component of Transformer models, traditionally incurs a quadratic complexity with respect to the input sequence length, limiting its scalability for longer sequences. Nyströmformer employs the Nyström method, a well-established technique in numerical linear algebra, to approximate self-attention with linear complexity $O(n)$ in both time and memory, where $n$ is the input sequence length.

Methodology

Nyströmformer's central innovation lies in its application of the Nyström method to approximate the softmax matrix used in the self-attention mechanism, without needing to compute the full $n \times n$ matrix. This is achieved by selecting a subset of landmark points from the query (Q) and key (K) matrices before applying the softmax function. The Nyström method reconstructs a low-rank approximation of the softmax matrix by leveraging these landmarks, drastically reducing both the time and memory requirements compared to the full matrix computation. The paper introduces an efficient iterative technique to compute the Moore-Penrose pseudoinverse, a crucial component of the Nyström approximation, using fast matrix-matrix multiplications.

Experimental Results

The effectiveness of Nyströmformer is validated through extensive experiments on language modeling tasks, showcasing its ability to achieve competitive accuracy with renowned baseline models like BERT. Particularly, the model demonstrates comparable performance on masked-language-modeling (MLM) and sentence-order-prediction (SOP) tasks, using only about half of the computational resources. Furthermore, when fine-tuned on various downstream NLP tasks within the GLUE benchmark, Nyströmformer exhibits performance metrics close to those of baseline models, suggesting that the approximation does not significantly compromise accuracy.

A notable achievement is Nyströmformer's application in the Long Range Arena (LRA) benchmark, designed to test model efficacy on tasks requiring long-range context. Here, Nyströmformer outperforms several efficient self-attention variants, including Reformer, Linformer, and Performer, in terms of average accuracy.

Implications and Future Work

The Nyströmformer provides a promising approach towards scaling Transformer models to handle longer sequences efficiently, without the prohibitive computational costs normally involved. While the method shows substantial potential, further exploration could focus on the impact of different strategies for selecting landmark points and the trade-offs involved. Additionally, integrating Nyströmformer's efficient attention mechanism into larger transformer architectures, such as those used for vision or multimodal tasks, could be an interesting direction. Future research could also investigate the implications of this approach on the interpretability of attention mechanisms, as the structure and alignment of attention patterns are altered.

In summary, Nyströmformer advances the development of resource-efficient Transformer models, enabling their expanded application to domains where processing extensive sequences is essential. This work contributes meaningfully to the ongoing efforts in the community to mitigate the computational constraints associated with Transformer scalability, paving the way for more versatile and efficient AI systems.

Markdown Report Issue