Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models (2407.01955v1)

Published 2 Jul 2024 in cs.CL

Abstract: Deployment of autoregressive LLMs is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token generation process and reduce costs. Speculative decoding (SD) is among the most promising approaches to speed up the LLM decoding process by verifying multiple tokens in parallel and using an auxiliary smaller draft model to generate the possible tokens. In SD, usually, one draft model is used to serve a specific target model; however, in practice, LLMs are diverse, and we might need to deal with many target models or more than one target model simultaneously. In this scenario, it is not clear which draft model should be used for which target model, and searching among different draft models or training customized draft models can further increase deployment costs. In this paper, we first introduce a novel multi-target scenario for the deployment of draft models for faster inference. Then, we present a novel, more efficient sorted speculative decoding mechanism that outperforms regular baselines in multi-target settings. We evaluated our method on Spec-Bench in different settings, including base models such as Vicuna 7B, 13B, and LLama Chat 70B. Our results suggest that our draft models perform better than baselines for multiple target models at the same time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Hydra: Sequentially-dependent draft heads for medusa decoding. arXiv preprint arXiv:2402.05109.
  2. Anthropic. 2024. Model card claude 3. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774.
  5. Jointly-learned exit and inference for a dynamic neural network. In The Twelfth International Conference on Learning Representations.
  6. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  8. Cascade speculative drafting for even faster llm inference. arXiv preprint arXiv:2312.11462.
  9. Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710.
  10. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057.
  11. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252.
  12. Sorted llama: Unlocking the potential of intermediate layers of large language models for dynamic inference. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2129–2145.
  13. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  14. Fast Inference from Transformers via Speculative Decoding. arXiv preprint. ArXiv:2211.17192 [cs].
  15. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077.
  16. Panda: Preference adaptation for enhancing domain-specific abilities of llms. arXiv preprint arXiv:2402.12835.
  17. Online speculative decoding. arXiv preprint arXiv:2310.07177.
  18. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853.
  19. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 1(2):4.
  20. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
  21. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31.
  22. Instantaneous grammatical error correction with shallow aggressive decoding. arXiv preprint arXiv:2106.04970.
  23. Spectr: Fast speculative decoding via optimal transport. In Advances in Neural Information Processing Systems, volume 36, pages 30222–30242. Curran Associates, Inc.
  24. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  25. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Sortednet, a place for every network and every network in its place: Towards a generalized solution for training many-in-one neural networks. arXiv preprint arXiv:2309.00255.
  28. Accelerating llama inference by enabling intermediate layer decoding via instruction tuning with lite. Preprint, arXiv:2310.18581.
  29. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925.
  30. Lossless speedup of autoregressive translation with generalized aggressive decoding. arXiv preprint arXiv:2203.16487.
  31. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851.
  32. Generation meets verification: Accelerating large language model inference with smart parallel auto-correct decoding. arXiv preprint arXiv:2402.11809.
  33. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168.
  34. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Preprint, arXiv:2306.14048.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  36. Wei Zhong and Manasa Bharadwaj. 2024. S3d: A simple and cost-effective self-speculative decoding scheme for low-memory gpus. arXiv preprint arXiv:2405.20314.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.