Papers
Topics
Authors
Recent
2000 character limit reached

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (2401.11181v1)

Published 20 Jan 2024 in cs.DC

Abstract: Transformer-based LLM inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
  2. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022.
  3. AWS Bedrock. https://docs.aws.amazon.com/bedrock/latest/userguide/inference-parameters.html.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  5. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 2022.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  8. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
  9. Towards next-generation intelligent assistants leveraging llm techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023.
  10. FaRM: Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), 2014.
  11. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
  12. Clio: A Hardware-Software Co-Designed Disaggregated Memory System. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.
  13. HiAscend. Atlas 900 AI Cluster. https://www.hiascend.com/en/hardware/cluster.
  14. HiAscend. CANN aclrtMemcpy. https://www.hiascend.com/document/detail/en/canncommercial/601/inferapplicationdev/aclcppdevg/aclcppdevg_03_0081.html.
  15. Flashdecoding++: Faster large language model inference on gpus. arXiv preprint arXiv:2311.01282, 2023.
  16. HugginFace. https://huggingface.co/docs/transformers/model_doc/opt#transformers.OPTForSequenceClassification.
  17. Hugging Face. https://huggingface.co/datasets/ZhongshengWang/Alpaca-pubmed-summarization.
  18. Hugging Face. https://huggingface.co/datasets/lancexiao/write_doc_sft_v1.
  19. Gpt-zip: Deep compression of finetuned large language models. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  20. A Jo. The promise and peril of generative ai. Nature, 2023.
  21. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, 2023.
  22. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, 2020.
  23. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. arXiv preprint arXiv:2302.11665, 2023.
  24. Rammer: Enabling holistic deep learning compiler optimizations with {{\{{rTasks}}\}}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020.
  25. NGINX. https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/.
  26. NVIDIA. CUDA Runtime API Memory Management. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html.
  27. NVIDIA. GPU Direct. https://developer.nvidia.com/gpudirect.
  28. NVIDIA. NCCL. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html.
  29. NVIDIA, FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
  30. NVIDIA, Triton Inference Server. https://developer.nvidia.com/.
  31. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.
  32. Splitwise: Efficient generative llm inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023.
  33. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 2023.
  34. Zero-shot text-to-image generation. In International Conference on Machine Learning, 2021.
  35. Sharegpt teams. https://sharegpt.com/.
  36. Fairness in serving large language models. arXiv preprint arXiv:2401.00588, 2023.
  37. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023.
  38. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288, 2023.
  39. Lightseq: A high performance inference library for transformers. arXiv preprint arXiv:2010.13887, 2020.
  40. Wikipedia. NVLink. https://en.wikipedia.org/wiki/NVLink.
  41. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023.
  42. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 2023.
  43. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023.
  44. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 2022.
  45. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.
  46. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023.
  47. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  48. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline. arXiv preprint arXiv:2305.13144, 2023.
Citations (33)

Summary

  • The paper presents TetriInfer, revealing that disaggregating LLM inference into prefill and decode tasks can reduce TTFT by 97% and job completion times by 47%.
  • It employs fixed-size prefill chunking and predictive scheduling using a smaller LLM classifier to efficiently manage mixed workload requests.
  • Empirical results demonstrate a 38% reduction in resource usage, enabling scalable and cost-effective LLM inference in cloud environments.

"Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads"

Introduction

The paper "Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads" (2401.11181) presents an innovative approach to managing inference serving in cloud environments for LLMs using a system named TetriInfer. Inference serving for LLMs, ubiquitous in modern cloud services, consists of two distinct phases: the prefill and decode phases. Existing deployment paradigms often conflate these distinct phases leading to considerable resource interference and inefficiencies. The authors address the challenge of minimizing such interference through the strategic disaggregation and scheduling of inference tasks.

Motivation

Given the variation in LLM inference requests, particularly in their prefill and decode token lengths, traditional approaches result in severe contention and inefficiency. An analysis detailed in the paper shows that such interference can lead to significant performance degradation, up to a 10x slowdown for prefill requests and a 16% throughput hit for decode requests. The authors propose disaggregating prefill from decode tasks, enabling each to be processed independently and more efficiently, thereby avoiding the inefficiencies inherent in the simultaneous execution of heterogeneous task characteristics.

Methodology

TetriInfer's foundational strategy lies in dividing inference into three critical components: fixed-size prefill chunking, disaggregated instances for prefill and decode, and predictive scheduling. Prefill inputs are segmented into fixed-size chunks to maintain accelerators near their computational limits without incurring penalties (Figure 1). Separate instances are maintained for prefill and decode tasks, with a dynamic scheduler leveraging predicted resource usage to optimally assign decode instances. Figure 1

Figure 1: Length Distribution. Prompt Tokens for Prefill and Generated Tokens during Decode.

The paper describes the implementation whereby prefill phases are executed in fixed-size batches, and decode tasks are managed by a scheduling algorithm that predicts and accommodates the length of generated tokens. This predictive model, which employs a smaller LLM as a classifier, anticipates the token length for more efficient scheduling. The decoupling extends to using different instances for prefill and decode tasks, further mitigating interference and enabling independent scaling and management.

Results

Empirical results demonstrate that TetriInfer dramatically improves key performance metrics such as time-to-first-token (TTFT) and job completion time (JCT). Specifically, the system uses 38% fewer resources while reducing average TTFT and JCT by 97% and 47%, respectively, indicating significant enhancements in computational efficiency. The system's ability to dynamically handle inference requests of varying lengths enhances its adaptability to real-world, high-variance workloads. This results in a marked improvement in performance per dollar compared to existing methods. Figure 2

Figure 2: TetriInfer's Workflow and Architecture.

Implications and Future Work

TetriInfer illustrates significant advancements in handling LLM workload variability and interference. The decoupling model proposed is scalable and adaptable, providing a framework that can be expanded in future research to encompass more complex inference environments and possibly integrate with other emergent LLM optimizations like model partitioning and distributed inference.

Moreover, the implementation paves the way for further exploration into fine-grained predictive modeling for decoding phases, potentially improving scheduling decisions and reducing latency even further. The exploration of alternative network stack optimizations, as indicated, could further improve efficiency in distributed settings.

Conclusion

The paper outlined in "Inference without Interference" successfully delineates a path forward for more effective and efficient handling of LLM inference tasks in cloud-based environments by strategically separating and scheduling prefill and decode tasks. TetriInfer represents a substantial step towards maximizing the utility of computational resources in the face of the growing complexity and demand for LLM services. This work offers a promising avenue for scaling AI applications while maintaining sustainable resource expenditure, setting a benchmark for future efforts in the area.

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.