LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization (2403.01136v1)
Abstract: Recent breakthroughs in Large-scale LLMs have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model partition and micro-batch sizing in distributed LLM serving with an efficient algorithm, to greatly enhance inference throughput while fulfilling user-specified model quality targets. Extensive experiments on production inference workloads in 11 different clusters demonstrate that LLM-PQ achieves up to 2.88x (2.26x on average) throughput improvement in inference, showing great advantages over state-of-the-art works.
- Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2022.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning, 2021.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. ArXiv, abs/2306.03078, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Energonai: An inference system for 10-100 billion parameter transformer models. arXiv preprint arXiv:2209.02341, 2022.
- Hugging Face. Text generation inference. https://github.com/huggingface/text-generation-inference, n.d. Accessed on: July 24, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022.
- OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023.
- Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
- Pipeline parallelism for inference on heterogeneous edge computing. ArXiv, abs/2110.14895, 2021.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, 2023.
- Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv, abs/2306.00978, 2023.
- Exact: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2022.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
- Pointer sentinel mixture models, 2016.
- NVIDIA. Fastertransformer: Transformer related optimization, including bert, gpt, n.d.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), August 2016.
- Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019.
- OpenLLM: Operating LLMs in production, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- RyokoAI. Sharegpt52k. https://huggingface.co/datasets/RyokoAI/ShareGPT52K, 2021. Dataset accessed on [insert date].
- Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
- Adaptive message quantization and parallelization for distributed full-graph gnn training. ArXiv, abs/2306.01381, 2023.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 2023.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems, 2022.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.
- Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, 2019.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
- Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022.