Online Speculative Decoding (2310.07177v4)

Published 11 Oct 2023 in cs.AI, cs.CL, and cs.LG

Abstract: Speculative decoding is a pivotal technique to accelerate the inference of LLMs by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42x to 2.17x latency reduction. Our code is available at https://github.com/LiuXiaoxuanPKU/OSD.

References (27)

Citations (36)

View on Semantic Scholar

Summary

The paper proposes online speculative decoding, dynamically updating draft models via online knowledge distillation to boost token acceptance and reduce latency.
It leverages a smaller draft model to pre-select tokens, achieving improvements of 0.1 to 0.65 in token acceptance rate and reducing latency by 1.22x to 3.06x.
The approach mitigates model capability gaps and distribution shifts, offering practical benefits for latency-critical applications in real-time LLM services.

Insights into "Online Speculative Decoding"

This paper presents a methodological advancement in the inference process of LLMs by introducing a technique termed "Online Speculative Decoding". The work addresses the challenge of optimizing the latency of LLMs, which is crucial given their increasing deployment in applications with stringent latency requirements such as search engines, chatbots, and virtual assistants.

Speculative decoding utilizes a smaller draft model to propose output tokens for the target LLM, which can then be verified in parallel by the target LLM. This approach aims to accelerate the token generation process by pre-emptively selecting potential outputs. However, the primary bottleneck with this method is the accuracy of the draft model's predictions, especially when there is a significant capability gap between the draft and the target model.

The paper introduces an innovative method called online speculative decoding, which dynamically updates the draft model based on live query data, leveraging the surplus computational power typically found in LLM serving clusters. This approach involves online knowledge distillation to improve the draft model's predictive performance on the current query distribution. Such continuous adaptation allows for real-time alignment of the draft model with the prevailing query patterns, thus mitigating the effects of distribution shifts and enhancing the overall efficiency of speculative decoding.

The authors report substantial improvements in the token acceptance rate, achieving increases between 0.1 to 0.65. This, in turn, results in a latency reduction of 1.22x to 3.06x as measured across several popular LLMs using both synthetic and real query data. These results denote a significant performance improvement compared to static draft models constructed through offline methods.

Implications and Future Directions

The implications of these findings are twofold. Practically, the methodology provides a cost-effective solution for improving LLM service latency, which has direct benefits for user satisfaction and system efficiency in real-world applications. Theoretically, this approach enriches the field of speculative inference by demonstrating the value of dynamic model adaptation to shifting data distributions in real-time.

Looking forward, this research opens pathways for further exploration into the synergy of online learning and inference optimization. For instance, future work might investigate the balance between computational resource usage and model performance enhancement, the scalability of the proposed method to larger LLMs or diverse application domains, and the integration of more sophisticated knowledge distillation techniques to further close the gap between draft and target models.

Overall, this paper contributes a significant innovation to the LLM inference landscape, highlighting the importance of adaptability and resource utilization in modern AI systems.