Recurrent Drafter for Fast Speculative Decoding in Large Language Models (2403.09919v5)
Abstract: We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for LLMs inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM's hidden states, (2) applying a dynamic tree attention algorithm over beam search results to eliminate duplicated prefixes in candidate sequences, and (3) training through knowledge distillation from the LLM. ReDrafter accelerates Vicuna inference in MT-Bench by up to 2.8x with a PyTorch implementation on Nvidia H100 GPUs. To demonstrate its practicality in real environments, we also validated its effectiveness for on-device applications by implementing the approach in MLX and benchmarking performance on Metal GPUs in Apple Silicon chips, achieving up to 2.3x speedup.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023a.
- Cascade speculative drafting for even faster llm inference. arXiv preprint arXiv:2312.11462, 2023b.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2023.
- Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
- Truncation sampling as language model desmoothing. arXiv preprint arXiv:2210.15191, 2022.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024.
- Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
- Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 234–239, 2012. doi: 10.1109/SLT.2012.6424228.
- ShareGPT. Sharegpt, 2023. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
- Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.