Splitwise: Efficient generative LLM inference using phase splitting (2311.18677v2)
Abstract: Recent innovations in generative LLMs have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.
- AMD Instinct™ MI250 Accelerator. [Online]. Available: https://www.amd.com/en/products/server-accelerators/instinct-mi250
- Azure InfiniBand HPC VMs. [Online]. Available: https://learn.microsoft.com/en-us/azure/virtual-machines/overview-hb-hc
- CoreWeave - Specialized Cloud Provider. [Online]. Available: https://www.coreweave.com
- Google Assistant with Bard. [Online]. Available: https://blog.google/products/assistant/google-assistant-bard-generative-ai/
- HPC Interconnect on CoreWeave Cloud. [Online]. Available: https://docs.coreweave.com/networking/hpc-interconnect
- Intel BigDL-LLM. [Online]. Available: https://github.com/intel-analytics/BigDL
- Intel Sapphire Rapids with HBM. [Online]. Available: https://www.anandtech.com/show/17422/intel-showcases-sapphire-rapids-plus-hbm-xeon-performance-isc-2022
- Microsoft Azure ND A100 v4-series . [Online]. Available: https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series
- MSCCL++: A GPU-driven communication stack for scalable AI applications. [Online]. Available: https://github.com/microsoft/mscclpp
- Numenta Inference on CPUs. [Online]. Available: https://www.servethehome.com/numenta-has-the-secret-to-ai-inference-on-cpus-like-the-intel-xeon-max/
- NVIDIA Accelerated InfiniBand Solutions. [Online]. Available: https://www.nvidia.com/en-us/networking/products/infiniband/
- NVIDIA chip shortage. [Online]. Available: https://www.wired.com/story/nvidia-chip-shortages-leave-ai-startups-scrambling-for-computing-power/
- NVIDIA Collective Communications Library (NCCL). [Online]. Available: https://developer.nvidia.com/nccl
- NVIDIA DGX H100. [Online]. Available: https://www.nvidia.com/en-us/data-center/dgx-h100/
- OpenAI ChatGPT APIs. [Online]. Available: https://openai.com/blog/introducing-chatgpt-and-whisper-apis
- Power availability stymies datacenter growth. [Online]. Available: https://www.networkworld.com/article/972483/power-availability-stymies-data-center-growth.
- The new Bing. [Online]. Available: https://www.microsoft.com/en-us/edge/features/the-new-bing?form=MT00D8
- TurboMind Inference server. [Online]. Available: https://github.com/InternLM/lmdeploy
- A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills,” 2023.
- R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale,” in SC, 2022.
- L. A. Barroso, U. Hölzle, and P. Ranganathan, “The Datacenter as a Computer: Designing Warehouse-Scale Machines.” [Online]. Available: https://www.morganclaypool.com/doi/abs/10.2200/S00874ED3V01Y201809CAC046
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020.
- T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” 2022.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- V. Gupta, M. Harchol Balter, K. Sigman, and W. Whitt, “Analysis of join-the-shortest-queue routing for web server farms,” Performance Evaluation, vol. 64, no. 9, pp. 1062–1081, 2007, performance 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0166531607000624
- M. E. Haque, Y. He, S. Elnikety, T. D. Nguyen, R. Bianchini, and K. S. McKinley, “Exploiting heterogeneity for tail latency and energy efficiency,” in MICRO, 2017.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” in SOSP, 2023.
- Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” 2023.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019.
- Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y. Tian, C. Re et al., “Deja vu: Contextual sparsity for efficient llms at inference time.”
- Meta. Introducing the AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research. [Online]. Available: https://ai.facebook.com/blog/ai-rsc/
- “Azure OpenAI Service,” Microsoft Azure, 2022. [Online]. Available: https://azure.microsoft.com/en-us/products/ai-services/openai-service
- R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Ratnasamy, and S. Shenker, “Revisiting network support for RDMA,” CoRR, vol. abs/1806.08159, 2018. [Online]. Available: http://arxiv.org/abs/1806.08159
- NVIDIA. DGX A100: Universal System for AI Infrastructure. [Online]. Available: https://resources.nvidia.com/en-us-dgx-systems/dgx-ai
- OpenAI. Scaling Kubernetes to 7,500 nodes. [Online]. Available: https://openai.com/research/scaling-kubernetes-to-7500-nodes
- P. Patel, E. Choukse, C. Zhang, Í. Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “POLCA: Power Oversubscription in LLM Cloud Providers,” arXiv preprint arXiv:2308.12908, 2023.
- P. Patel, Z. Gong, S. Rizvi, E. Choukse, P. Misra, T. Anderson, and A. Sriraman, “Towards Improved Power Management in Cloud GPUs,” in IEEE CAL, 2023.
- P. Patel, K. Lim, K. Jhunjhunwalla, A. Martinez, M. Demoulin, J. Nelson, I. Zhang, and T. Anderson, “Hybrid Computing for Interactive Datacenter Applications,” arXiv preprint arXiv:2304.04488, 2023.
- R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” in MLSys, 2023.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “BLOOM: A 176B-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
- P. Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed and Hugging Face Transformers. [Online]. Available: https://www.philschmid.de/fine-tune-flan-t5-deepspeed
- Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Re, I. Stoica, and C. Zhang, “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,” 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. v. Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art Natural Language Processing,” in EMNLP, 2020.
- B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast distributed inference serving for large language models,” arXiv preprint arXiv:2305.05920, 2023.
- G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in OSDI, 2022.
- C. Zhang, M. Yu, W. Wang, and F. Yan, “MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving,” in USENIX ATC, 2019. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/zhang-chengliang
- W. Zhu, “Analysis of JSQ policy on soft real-time scheduling in cluster,” in Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, vol. 1, 2000, pp. 277–282 vol.1.