Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Cascade Speculative Drafting for Even Faster LLM Inference (2312.11462v4)

Published 18 Dec 2023 in cs.LG and cs.CL

Abstract: Introduced to enhance the efficiency of LLM inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal Cascade optimizes time allocation in drafting for improved efficiency. Combining both cascades, CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments, while maintaining the same output distribution as the target model. Our code is publicly available at https://github.com/lfsszd/CS-Drafting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Efficient 8-bit quantization of transformer neural machine language translation model, 2019.
  2. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  3. The lottery ticket hypothesis for pre-trained bert networks, 2020.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  5. Training verifiers to solve math word problems, 2021.
  6. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, March 2021. ISSN 1573-1405. doi: 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z.
  7. Parameter-efficient transfer learning with diff pruning, 2021.
  8. Rest: Retrieval-based speculative decoding, 2023.
  9. Measuring massive multitask language understanding, 2021.
  10. Distilling the knowledge in a neural network, 2015.
  11. Speculative decoding with big little decoder, 2023.
  12. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  13. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  14. Online speculative decoding, 2023.
  15. OpenAI. Gpt-4 technical report, 2023.
  16. Mixed precision post training quantization of neural networks with sensitivity guided search, 2023.
  17. Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  01–15. IEEE, 2022.
  18. Efficient methods for natural language processing: A survey, 2023.
  19. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, 2019.
  20. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  21. Structured pruning learns compact and accurate models, 2022.
  22. Distillspec: Improving speculative decoding via knowledge distillation, 2023.
Citations (35)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: