RelayAttention for Efficient Large Language Model Serving with Long System Prompts (2402.14808v3)
Abstract: A practical LLM service may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (\ie, key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. We have observed significant performance improvements to a production-level system, vLLM, through integration with RelayAttention. The improvements are even more profound with longer system prompts.
- Anthropic. 2023. https://www.anthropic.com/index/100k-context-windows.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Deepseek llm: Scaling open-source language models with longtermism.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934.
- GitHub. 2022. Github copilot. https://github.com/features/copilot.
- Google. 2023a. https://bard.google.com.
- Google. 2023b. Gemini - google deepmind. https://deepmind.google/technologies/gemini.
- Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
- Cachegen: Fast context loading for language model applications. arXiv preprint arXiv:2310.07240.
- Microsoft. 2023a. https://www.microsoft.com/en-us/windows/copilot-ai-features.
- Microsoft. 2023b. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
- Large language models as tax attorneys: A case study in legal capabilities emergence. arXiv preprint arXiv:2306.07075.
- OpenAI. 2021. https://openai.com/research/triton.
- OpenAI. 2022. https://openai.com/blog/chatgpt.
- OpenAI. 2023a. https://openai.com/blog/custom-instructions-for-chatgpt.
- OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5.
- Improving language understanding by generative pre-training.
- Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86.
- ShareGPT. 2023. https://sharegpt.com/.
- Language models are an effective representation learning technique for electronic health record data. Journal of biomedical informatics, 113:103637.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538.
- H _2_2\_2_ 2 o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048.