2000 character limit reached
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (2404.07143v2)
Published 10 Apr 2024 in cs.CL, cs.AI, cs.LG, and cs.NE
Abstract: This work introduces an efficient method to scale Transformer-based LLMs to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context LLMing benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023b.
- Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
- Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
- In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
- Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
- Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology press, 2005.
- Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pp. 177–186, 1987.
- John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
- Pentti Kanerva. Sparse distributed memory. MIT press, 1988.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
- The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209, 2021.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
- Differentiable plasticity: training plastic neural networks with backpropagation. In International Conference on Machine Learning, pp. 3559–3568. PMLR, 2018.
- Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 2024.
- Meta networks. In International conference on machine learning, pp. 2554–2563. PMLR, 2017a.
- Neural semantic encoders. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 1, pp. 397. NIH Public Access, 2017b.
- Citation analysis with neural attention models. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis, pp. 69–77, 2016.
- Metalearned neural memory. Advances in Neural Information Processing Systems, 32, 2019.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947, 2022.
- Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611, 2019.
- Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831, 2020.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355–9366. PMLR, 2021.
- Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
- Efficient attention: Attention with linear complexities. arXiv preprint arXiv:1812.01243, 2018.
- Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1-2):159–216, 1990.
- End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
- Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, pp. 9902–9912. PMLR, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Primera: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint arXiv:2110.08499, 2021.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.