Vidur: A Large-Scale Simulation Framework For LLM Inference (2405.05465v2)
Abstract: Optimizing the deployment of LLMs is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.
- arxiv.org e-print archive. https://arxiv.org/.
- Cupti: Cuda toolkit documentation. https://docs.nvidia.com/cuda/cupti/index.html.
- Faster Transformer. https://github.com/NVIDIA/FasterTransformer.
- Google duet ai. https://workspace.google.com/solutions/ai/.
- Microsoft copilot. https://www.microsoft.com/en-us/microsoft-copilot.
- https://github.com/vllm-project/vllm.
- LightLLM: A python-based large language model inference and serving framework. https://github.com/ModelTC/lightllm, 2023.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
- Taming throughput-latency tradeoff in llm inference with sarathi-serve. 2024.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
- cudnn: Efficient primitives for deep learning, 2014.
- A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 615–621, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclanthology.org/N18-2097.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Flash-decoding for long-context inference, 2023.
- Proteus: Simulating the performance of distributed DNN training. CoRR, abs/2306.02267, 2023. doi: 10.48550/arXiv.2306.02267. URL https://doi.org/10.48550/arXiv.2306.02267.
- Speed: Speculative pipelined execution for efficient decoding, 2023.
- Discourse centric evaluation of machine translation with a densely annotated parallel corpus. In Proceedings of the 2023 Conference of the Association for Computational Linguistics: Human Language Technologies, pp. 1550–1565, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.main.111. URL https://aclanthology.org/2023.acl-main.111.
- Efficient memory management for large language model serving with pagedattention. In Flinn, J., Seltzer, M. I., Druschel, P., Kaufmann, A., and Mace, J. (eds.), Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pp. 611–626. ACM, 2023. doi: 10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
- Textbooks are all you need ii: phi-1.5 technical report. September 2023. URL https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need-ii-phi-1-5-technical-report/.
- Terapipe: Token-level pipeline parallelism for training large-scale language models, 2021.
- Building a performance model for deep learning recommendation model training on gpus. In 29th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2022, Bengaluru, India, December 18-21, 2022, pp. 48–58. IEEE, 2022. doi: 10.1109/HiPC56025.2022.00019. URL https://doi.org/10.1109/HiPC56025.2022.00019.
- NVIDIA Corporation. CUBLAS library. https://docs.nvidia.com/cuda/cublas/index.html, a.
- NVIDIA Corporation. Matrix multiplication background user’s guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, b.
- The inference cost of search disruption – large language model cost analysis, 2023.
- Splitwise: Efficient generative llm inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023.
- Efficiently scaling transformer inference, 2022.
- Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Astra: Exploiting predictability to optimize deep learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pp. 909–923, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362405. doi: 10.1145/3297858.3304072. URL https://doi.org/10.1145/3297858.3304072.
- Team, I. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, apr 2009. ISSN 0001-0782. doi: 10.1145/1498765.1498785. URL https://doi.org/10.1145/1498765.1498785.
- Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 595–610, Carlsbad, CA, October 2018. USENIX Association. ISBN 978-1-939133-08-3. URL https://www.usenix.org/conference/osdi18/presentation/xiao.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp. 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https://www.usenix.org/conference/osdi22/presentation/yu.
- Habitat: A runtime-based computational performance predictor for deep neural network training. In Calciu, I. and Kuenning, G. (eds.), 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pp. 503–521. USENIX Association, 2021. URL https://www.usenix.org/conference/atc21/presentation/yu.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv preprint arXiv:2401.09670, 2024.
- Daydream: Accurately estimating the efficacy of optimizations for DNN training. In Gavrilovska, A. and Zadok, E. (eds.), 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, pp. 337–352. USENIX Association, 2020. URL https://www.usenix.org/conference/atc20/presentation/zhu-hongyu.