Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (2403.02310v3)
Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.
- Amazon codewhisperer. https://aws.amazon.com/codewhisperer/.
- Anthropic claude. https://claude.ai.
- arxiv.org e-print archive. https://arxiv.org/.
- Bing ai. https://www.bing.com/chat.
- Character ai. https://character.ai.
- Chatgpt. https://chat.openai.com.
- Faster Transformer. https://github.com/NVIDIA/FasterTransformer.
- Github copilot. https://github.com/features/copilot.
- Google bard. https://bard.google.com.
- Google duet ai. https://workspace.google.com/solutions/ai/.
- Komo. https://komo.ai/.
- Lightllm: A light and fast inference service for llm. https://github.com/ModelTC/lightllm.
- Matrix multiplication background user’s guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html.
- Microsoft copilot. https://www.microsoft.com/en-us/microsoft-copilot.
- Nvidia collective communications library (nccl). https://developer.nvidia.com/nccl.
- NVIDIA Triton Dynamic Batching. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher.
- Openai gpt-3: Understanding the architecture. https://www.theaidream.com/post/openai-gpt-3-understanding-the-architecture.
- Perplexity ai. https://www.perplexity.ai/.
- Replit ghostwriter. https://replit.com/site/ghostwriter.
- Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM.
- https://github.com/vllm-project/vllm.
- XFORMERS OPTIMIZED OPERATORS. https://facebookresearch.github.io/xformers/components/ops.html.
- Yi series of large language models trained from scratch by developers at 01.AI. https://huggingface.co/01-ai/Yi-34B-200K.
- You.com. https://you.com/.
- Apiserve: Efficient api support for large-language model inferencing. arXiv preprint arXiv:2402.01869, 2024.
- Abien Fred Agarap. Deep learning using rectified linear units (relu), 2019.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- The falcon series of open language models, 2023.
- Efficient large scale language modeling with mixtures of experts, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
- A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
- Clipper: A {{\{{Low-Latency}}\}} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
- Qlora: Efficient finetuning of quantized llms, 2023.
- Turbotransformers: an efficient GPU serving system for transformer models. In PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pages 389–402. ACM, 2021.
- Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
- Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
- Serving {{\{{DNNs}}\}} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462, 2020.
- Gaussian error linear units (gelus), 2023.
- Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference, 2024.
- Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
- Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference, 2023.
- Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 402–416, New York, NY, USA, 2022. Association for Computing Machinery.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. CoRR, abs/2001.08361, 2020.
- Efficient memory management for large language model serving with pagedattention. SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- Accelerating distributed MoE training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, Boston, MA, July 2023. USENIX Association.
- Tensorflow-serving: Flexible, high-performance ml serving, 2017.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Splitwise: Efficient generative llm inference using phase splitting, 2023.
- Efficiently scaling transformer inference, 2022.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory, 2022.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
- Fairness in serving large language models. arXiv preprint arXiv:2401.00588, 2023.
- Flexgen: High-throughput generative inference of large language models with a single gpu, 2023.
- Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Retentive network: A successor to transformer for large language models, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Openchat: Advancing open-source language models with mixed-quality data, 2023.
- Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ASPLOS 2023, page 93–106, New York, NY, USA, 2022. Association for Computing Machinery.
- LightSeq: A high performance inference library for transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers (NAACL-HLT), pages 113–120. Association for Computational Linguistics, June 2021.
- Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
- Fast distributed inference serving for large language models, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models, 2023.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024.