Efficient and Economic Large Language Model Inference with Attention Offloading (2405.01814v1)
Abstract: Transformer-based LLMs exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.
- Ray. https://www.ray.io/.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
- A survey on processing-in-memory techniques: Advances and challenges. Memories-Materials, Devices, Circuits and Systems, 4:100022, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493–506, Los Alamitos, CA, USA, mar 2021. IEEE Computer Society.
- Flash-decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.
- Turbotransformers: An efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 389–402, New York, NY, USA, 2021. Association for Computing Machinery.
- Attmemo : Accelerating transformers with memoization on big memory systems, 2023.
- Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
- GitHub. GitHub Copilot. https://github.com/features/copilot.
- Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 372–385, 2020.
- Flashdecoding++: Faster large language model inference on gpus, 2023.
- Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer. IEEE Micro, 42(3):20–30, 2022.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- System architecture and software stack for gddr6-aim. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–25, 2022.
- Fast inference from transformers via speculative decoding, 2023.
- AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023.
- Ring attention with blockwise transformers for near-infinite context, 2023.
- Online speculative decoding, 2023.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
- Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
- Microsoft. Bing ai. https://chat.bing.com/.
- NVIDIA. Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM.
- OpenAI. Chatgpt. https://chat.openai.com/.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA, 2019.
- Splitwise: Efficient generative llm inference using phase splitting, 2023.
- Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- E-batch: Energy-efficient and high-throughput rnn batching. ACM Trans. Archit. Code Optim., 19(1), jan 2022.
- ShareGPT Team. Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/.
- Is chatgpt the ultimate programming assistant – how far is it?, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Fast distributed inference serving for large language models, 2023.
- Bp-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- Opt: Open pre-trained transformer language models, 2022.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.