Emergent Mind

Abstract

Drafting-then-verifying decoding methods such as speculative decoding are widely adopted training-free methods to accelerate the inference of LLMs. Instead of employing an autoregressive process to decode tokens sequentially, speculative decoding initially creates drafts with an efficient small model. Then LLMs are required to conduct verification and correction in a non-autoregressive fashion to minimize time overhead. Generating longer drafts can lead to even more significant speedups once verified, but also incurs substantial trial and error costs if it fails. Suffering from the high verification failure probability, existing decoding methods cannot draft too much content for verification at one time, achieving sub-optimal inference acceleration. In this paper, we introduce Ouroboros, which constructs a phrase candidate pool from the verification process of LLMs to provide candidates for draft generation of the small model. Thereby, Ouroboros can further improve the efficiency and effectiveness of the initial drafts. The experimental results on typical text generation tasks show that Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead decoding and speculative decoding, respectively. The source code of Ouroboros is available at https://github.com/thunlp/Ouroboros.

Comparison of Ouroboros, Lookahead, Speculative, and Greedy Decoding methods using color-coded word generation models.

Overview

  • Ouroboros introduces a speculative decoding framework to improve inference acceleration in LLMs through a drafting-then-verifying approach, enhancing both speed and efficiency.

  • The framework features a shared candidate pool to improve the quality and length of drafts generated by smaller models before verification by LLMs, addressing issues of previous methods.

  • Empirical tests in tasks like code generation and machine translation show Ouroboros achieving up to 1.9× and 2.8× faster inference speeds while maintaining task performance quality.

  • Ouroboros signifies a step towards optimizing LLM efficiency for real-time applications, with potential for future research on model interactions and decoding strategies.

Enhancing Inference Acceleration in LLMs with Ouroboros: A Speculative Decoding Framework

Introduction

The recent advancements in LLMs have set remarkable benchmarks in various natural language processing tasks. However, the stringent requirement for efficient inference in real-time applications presents a significant challenge. The crux of the matter lies in the inference inefficiency arising from the autoregressive decoding mechanism prevalent in LLMs, which decodes tokens sequentially, thus limiting parallelization capabilities and leading to extensive computational overheads. Addressing this, the paper introduces Ouroboros, an innovative decoding framework designed to enhance the initial drafting phase significantly and utilize the verification errors constructively, enabling faster and more efficient inference for LLMs without compromising task performance.

Speculative Decoding Framework

Ouroboros operates on a drafting-then-verifying decoding principle, starting with generating initial drafts using a smaller model and subsequently utilizing an LLM for verification. Uniquely, Ouroboros introduces a phrase candidate pool, leveraging the verification outcomes to enrich the drafting phase, thus generating longer and more accurate drafts. This iterative refinement facilitated by the candidate pool not only improves inference speed but also ensures the quality of generated content, tackling the fundamental limitations observed in existing drafting-then-verifying methods related to insufficient draft lengths and underutilized verification results.

Framework Components and Mechanisms

Ouroboros methodology extends beyond conventional speculative decoding by introducing several pivotal features:

  • Shared Candidate Pool: It fosters a well-integrated interaction between the drafting and verifying phases. By utilizing a phrase candidate pool for drafting, Ouroboros enhances both the length and quality of initial drafts, leading to accelerated inference times.
  • Utilization of Verification Results: Instead of discarding tokens following a verification failure, Ouroboros uses them for candidate inspiration, efficiently leveraging all verification outputs to refine subsequent drafts.
  • Warm Start Capability: Addressing the issue of cold starts, Ouroboros implements a pre-filled candidate pool based on similar tasks, further enhancing generation speeds through context locality.

Empirical Validation

Across a spectrum of text generation tasks, including code generation and machine translation, Ouroboros has demonstrated substantial improvements in inference acceleration, achieving up to 1.9× and 2.8× speed increases compared to lookahead and speculative decoding, respectively. Furthermore, Ouroboros’s approach is lossless regarding task performance, maintaining the output quality of the LLMs used.

Implications and Future Directions

The development of Ouroboros signifies a promising direction in the endeavor to reconcile the need for real-time responsiveness with the computational demands of LLMs. This framework opens avenues for further research into optimizing the interaction between larger and smaller models in generative tasks, exploring the bounds of efficiency and quality in model drafting and verification processes. Additionally, while the current implementation focuses on greedy decoding scenarios, extending Ouroboros to support random sampling decoding strategies presents a potential area for future investigation.

Conclusion

Ouroboros emerges as a groundbreaking framework in the landscape of LLM inference acceleration, addressing the dual challenges of inefficiency and quality compromise. Through its innovative use of a shared candidate pool and the constructive application of verification results, Ouroboros stands as a testament to the possibilities inherent in speculative decoding methodologies. As the field of AI continues to evolve, such advancements herald a new era of efficiency and capability for real-world applications of LLMs.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.