Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Minions: Accelerating Large Language Model Inference with Aggregated Speculative Execution (2402.15678v2)

Published 24 Feb 2024 in cs.DC

Abstract: LLMs (LLM) have recently attracted surging interest due to their outstanding capabilities across various domains. However, enabling efficient LLM inference is challenging due to its autoregressive decoding that generates tokens only one at a time. Although research works apply pruning or quantization to speed up LLM inference, they typically require fine-tuning the LLM, incurring significant time and economic costs. Meanwhile, speculative decoding has been proposed to use small speculative models (SSMs) to accelerate the inference of LLM. However, the low acceptance rate of SSM and the high verification cost of LLM prohibit further performance improvement of inference. In this paper, we propose Minions, an LLM inference system that accelerates LLM inference with a collective and adaptive speculative generation. Specifically, Minions proposes a majority-voted mechanism to leverage multiple SSMs to jointly speculate the outputs of LLM, which improves the inference performance without introducing prohibitive computation costs for LLM. To better trade off the number of tokens speculated from SSM and the verification cost of LLM, Minions proposes an adaptive mechanism to dynamically determine the optimal speculation length of SSM, which can achieve better inference performance across different models, datasets, and hyper-parameters. In addition, Minions decouples the SSM decoding and LLM verification efficiently and adopts a pipelined execution mechanism to further improve the inference performance of LLM. By comparing with the state-of-the-art LLM inference systems, we demonstrate that Minions can achieve higher inference throughput and lower inference time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
  3. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  5. Serving heterogeneous machine learning models on multi-gpu servers with spatio-temporal sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 199–216, 2022.
  6. Alex de Vries. The growing energy footprint of artificial intelligence. Joule, 7(10):2191–2194, 2023.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  8. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
  9. FlexFlow. Llama-160m, https://huggingface.co/jackfram/llama-160m, 2023.
  10. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  11. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
  12. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  13. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020.
  14. HuggingFace. Chatbot, https://huggingface.co/datasets/alespalla/ chatbot_instruction_prompts, 2023.
  15. HuggingFace. Finance, https://huggingface.co/datasets/gbharti/finance-alpaca, 2023.
  16. HuggingFace. Text generation inference, https://github.com/huggingface/text-generation-inference, 2023.
  17. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  18. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  19. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  20. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  21. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
  22. Meta. Opt-125m, https://huggingface.co/facebook/opt-125m, 2022.
  23. Meta. Opt-13b, https://huggingface.co/facebook/opt-13b, 2022.
  24. Meta. Llama2-70b-chat, https://huggingface.co/meta-llama/llama-2-70b-chat-hf, 2023.
  25. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
  26. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  27. Microsoft. Deepspeed fastgen, https://github.com/microsoft/deepspeed/tree/ master/blogs/deepspeed-fastgen, 2023.
  28. MohamedRashad. Chatgpt-prompts,https://huggingface.co/datasets/mohamedrashad/ chatgpt-prompts, 2023.
  29. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  30. NVIDIA. Nvidia: Sharing a gpu between mpi processes: multiple-process service, https://docs.nvidia.com/deploy/mps/index.html, 2012.
  31. NVIDIA. Tensorrt-llm, https://github.com/nvidia/tensorrt-llm, 2023.
  32. OpenAI. Openai pricing,https://openai.com/pricing, 2023.
  33. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
  34. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 891–905, 2020.
  35. Princeton. Medusa, https://sites.google.com/view/medusa-llm, 2023.
  36. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, 2019.
  37. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020.
  38. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  39. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
  40. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  41. Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318, 2022.
  42. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  43. Cognn: efficient scheduling for concurrent gnn training on gpus. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  45. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  46. Lightseq: A high performance inference library for transformers. arXiv preprint arXiv:2010.13887, 2020.
  47. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
  48. Review networks for caption generation. Advances in neural information processing systems, 29, 2016.
  49. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  50. Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  51. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Summary

We haven't generated a summary for this paper yet.