On The Computational Complexity of Self-Attention (2209.04881v1)

Published 11 Sep 2022 in cs.LG and cs.CC

Abstract: Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.

Citations (79)

View on Semantic Scholar

Summary

The paper proves that self-attention mechanisms inherently require quadratic time complexity under SETH, challenging efficient scaling.
It evaluates approximation strategies such as sparsification and kernel methods, highlighting their lack of rigorous provable guarantees.
It demonstrates that dot-product self-attention can be approximated in linear time via Taylor series, albeit with exponential dependence on polynomial degree.

Analyzing the Computational Complexity of Self-Attention

The paper "On the Computational Complexity of Self-Attention" addresses a fundamental question related to the efficiency of transformer architectures, specifically focusing on the self-attention mechanism, which forms a critical component of transformers. Despite its profound successes across diverse applications, including natural language processing, computer vision, and proteomics, the computational cost of self-attention remains quadratic in the sequence length due to pairwise operations on tokens. This quadratic time complexity poses significant challenges, particularly when dealing with long sequences during both training and inference phases.

Core Contributions

The authors articulate the central question regarding the computational trade-offs inherent in self-attention mechanisms and whether it is possible to achieve sub-quadratic algorithms with provable accuracy. Through careful theoretical analysis supported by complexity theory, specifically the Strong Exponential Time Hypothesis (SETH), the authors establish strong lower bounds suggesting that overcoming the quadratic barrier might be infeasible without compromising accuracy.

Key insights from the paper include:

Quadratic Lower Bounds: The authors prove that the time complexity of the self-attention mechanism is inherently quadratic unless the SETH hypothesis is false. This result holds across different variations of attention mechanisms and even when allowing for approximate computation.
Approximation Strategies: While the authors acknowledge efforts to speed up self-attention by utilizing methods such as sparsification, hashing, and kernel approximations, they contend these strategies lack rigorous guarantees for error and accuracy, making the development of provably efficient algorithms challenging.
Sub-Quadratic Kernel Approximations: As a contribution towards establishing upper bounds, the paper demonstrates that dot-product self-attention can be approximated using finite Taylor series to achieve linear time complexity, albeit with an exponential dependence on the polynomial degree.

Implications and Future Directions

The findings of this paper have significant implications for the future of transformer model development, especially in optimizing self-attention layers. The results highlight a "no free lunch" phenomenon where computational speed cannot be significantly improved without some loss of accuracy. This insight prompts researchers to reevaluate assumptions and explore novel directions such as randomized algorithms or leveraging architectural innovations that can reduce time complexity while adhering to accuracy standards.

From a theoretical standpoint, the proofs based on reductions from difficult problems such as the Orthogonal Vectors Problem underscore the robustness of their claims within complexity theory frameworks. Although addressing worst-case scenarios, average-case evaluations and probabilistic models could be potential pathways that might offer new solutions.

Conclusion

Overall, "On the Computational Complexity of Self-Attention" provides a critical examination of the foundational limits of self-attention algorithms within transformer architectures. By engaging deeply with complexity theory, this work not only confirms speculations about the inherent computational challenges of self-attention but also sets boundaries for further research, encouraging advancements in efficient algorithm design with provable guarantees.