Online Speculative Decoding (2310.07177v4)
Abstract: Speculative decoding is a pivotal technique to accelerate the inference of LLMs by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42x to 2.17x latency reduction. Our code is available at https://github.com/LiuXiaoxuanPKU/OSD.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Gaurang Bharti. gbharti/finance-alpaca, 2023. URL https://huggingface.co/datasets/gbharti/finance-alpaca. Accessed: 2023-09-17.
- Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
- Knowledge distillation as efficient pre-training: Faster convergence, higher data-efficiency, and better transferability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9161–9171, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Ali Mazhar Luqmani. distilled bert topic, 2023. URL https://huggingface.co/alimazhar-110/website_classification. Accessed: 2023-10-07.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification, 2023.
- R OpenAI. Gpt-4 technical report. arXiv, pp. 2303–08774, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
- Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp. 521–538, 2022.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023a.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023b.