SparQ Attention: Bandwidth-Efficient LLM Inference (2312.04985v6)
Abstract: The computational difficulties of LLM inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Anonymous. Iceformer: Accelerated inference with long-sequence transformers on CPUs. In Submitted to The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6RR3wU4mSZ. under review.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426, 2021.
- Generating long sequences with sparse transformers, 2019.
- LM-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. https://github.com/karpathy/char-rnn, 2015.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
- Llm-qat: Data-free quantization aware training for large language models, 2023a.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. arXiv preprint arXiv:2305.17118, 2023b.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Improving language understanding by generative pre-training, 2018. URL https://openai.com/research/language-unsupervised.
- SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Combiner: Full attention transformer with sparse computation cost. Advances in Neural Information Processing Systems, 34:22470–22482, 2021.
- Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- Flexgen: high-throughput generative inference of large language models with a single GPU. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
- Sparse sinkhorn attention. In International Conference on Machine Learning, pages 9438–9447. PMLR, 2020a.
- Efficient transformers: A survey. CoRR, abs/2009.06732, 2020b. URL https://arxiv.org/abs/2009.06732.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Jesse Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37–42, 01 2019.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
- o(n)𝑜𝑛o(n)italic_o ( italic_n ) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems, 33:13783–13794, 2020.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
- H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTO: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.