Latte: Latent Attention for Linear Time Transformers (2402.17512v4)
Abstract: The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.
- ETC: Encoding Long and Structured Inputs in Transformers. arXiv preprint arXiv:2004.08483, 2020.
- Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150, 2020.
- JAX: Composable Transformations of Python+NumPy Programs, 2018. URL http://github.com/google/jax.
- Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794, 2020.
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv preprint arXiv:1901.02860, 2019.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models. arXiv preprint arXiv:2212.14052, 2022.
- OpenWebText Corpus, 2019. URL http://Skylion007.github.io/OpenWebTextCorpus.
- Efficiently Modeling Long Sequences with Structured State Spaces. arXiv preprint arXiv:2111.00396, 2021.
- Hutter, M. The Human Knowledge Compression Prize, 2002. URL https://www.kurzweilai.net/hutter-prize-for-lossless-compression-of-human-knowledge.
- Perceiver: General Perception with Iterative Attention. In International Conference on Machine Learning, pp. 4651–4664. PMLR, 2021.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
- Transformers in Vision: A Survey. ACM Computing Surveys (CSUR), 54(10s):1–41, 2022.
- Reformer: The Efficient Transformer. arXiv preprint arXiv:2001.04451, 2020.
- Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
- Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, 2011.
- Self-attention Does Not Need O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) Memory. arXiv preprint arXiv:2112.05682, 2021.
- Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9, 2019.
- Simplified State Space Layers for Sequence Modeling. arXiv preprint arXiv:2208.04933, 2022.
- Long Range Arena: A Benchmark for Efficient Transformers. arXiv preprint arXiv:2011.04006, 2020a.
- Efficient Transformers: A Survey. arXiv preprint arXiv:2009.06732, 2020b.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
- Attention Is All You Need. Advances In Neural Information Processing Systems, 30, 2017.
- ClusterFormer: Neural Clustering Attention for Efficient and Effective Transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2390–2402, 2022.
- Linformer: Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768, 2020.
- Big Bird: Transformers for Longer Sequences. Advances In Neural Information Processing Systems, 33:17283–17297, 2020.