Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation (2405.01814v2)
Abstract: Transformer-based LLMs exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests. To enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster. Experimental results indicate that Lamina can provide 16.1 ~ 90.1% higher estimated throughput than existing solutions with similar costs.
- Ray. https://www.ray.io/.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
- A survey on processing-in-memory techniques: Advances and challenges. Memories-Materials, Devices, Circuits and Systems, 4:100022, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493–506, Los Alamitos, CA, USA, mar 2021. IEEE Computer Society.
- Flash-decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.
- Turbotransformers: An efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 389–402, New York, NY, USA, 2021. Association for Computing Machinery.
- Attmemo : Accelerating transformers with memoization on big memory systems, 2023.
- Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
- GitHub. GitHub Copilot. https://github.com/features/copilot.
- Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 372–385, 2020.
- Flashdecoding++: Faster large language model inference on gpus, 2023.
- Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer. IEEE Micro, 42(3):20–30, 2022.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- System architecture and software stack for gddr6-aim. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–25, 2022.
- Fast inference from transformers via speculative decoding, 2023.
- AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023.
- Ring attention with blockwise transformers for near-infinite context, 2023.
- Online speculative decoding, 2023.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
- Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
- Microsoft. Bing ai. https://chat.bing.com/.
- NVIDIA. Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM.
- OpenAI. Chatgpt. https://chat.openai.com/.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA, 2019.
- Splitwise: Efficient generative llm inference using phase splitting, 2023.
- Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- E-batch: Energy-efficient and high-throughput rnn batching. ACM Trans. Archit. Code Optim., 19(1), jan 2022.
- ShareGPT Team. Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/.
- Is chatgpt the ultimate programming assistant – how far is it?, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Fast distributed inference serving for large language models, 2023.
- Bp-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- Opt: Open pre-trained transformer language models, 2022.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.