Emergent Mind

Optimal Block-Level Draft Verification for Accelerating Speculative Decoding

(2403.10444)
Published Mar 15, 2024 in cs.LG , cs.CL , cs.DS , cs.IT , and math.IT

Abstract

Speculative decoding has shown to be an effective method for lossless acceleration of LLMs during inference. In each iteration, the algorithm first uses a smaller model to draft a block of tokens. The tokens are then verified by the large model in parallel and only a subset of tokens will be kept to guarantee that the final output follows the distribution of the large model. In all of the prior speculative decoding works, the draft verification is performed token-by-token independently. In this work, we propose a better draft verification algorithm that provides additional wall-clock speedup without incurring additional computation cost and draft tokens. We first formulate the draft verification step as a block-level optimal transport problem. The block-level formulation allows us to consider a wider range of draft verification algorithms and obtain a higher number of accepted tokens in expectation in one draft block. We propose a verification algorithm that achieves the optimal accepted length for the block-level transport problem. We empirically evaluate our proposed block-level verification algorithm in a wide range of tasks and datasets, and observe consistent improvements in wall-clock speedup when compared to token-level verification algorithm. To the best of our knowledge, our work is the first to establish improvement over speculative decoding through a better draft verification algorithm.

The comparison of accepted tokens in token-level vs. block-level verification methods.

Overview

  • Speculative decoding is enhanced by introducing a block-level draft verification method, termed as Specblock, which uses an optimal transport problem formulation to improve efficiency without additional computational costs.

  • Specblock significantly outperforms the traditional token-level verification method, Spectoken, by ensuring better block efficiency and faster speculative decoding across various tasks and datasets.

  • The core advantage of Specblock lies in its ability to compute acceptance length optimally with a unique backward induction process, aligning closely with the large model's distribution without extra model calls.

  • Experimental evidence and theoretical analysis confirm Specblock's superiority in achieving faster speculative decoding, pointing towards its potential in making LLMs more accessible.

Improving Speculative Decoding with Block-Level Draft Verification

Introduction

Speculative decoding has become a prominent approach for accelerating the inference of LLMs by drafting blocks of tokens using a smaller model and verifying these tokens with the larger target model in parallel. However, the prevalent draft verification algorithm, Spectoken, verifies tokens independently, which may not be optimal. In this work, we present a novel formulation of the draft verification step as a block-level optimal transport problem, leading to a more efficient draft verification algorithm that enhances the speedup of speculative decoding without incurring extra computational costs.

The Optimal Transport Problem Formulation

The crux of our approach hinges on the formulation of the draft verification process as an optimal transport problem at the block level, with the aim of maximizing the expected number of accepted tokens in one draft block—directly correlating with the decoding speed-up. We propose a verification algorithm that guarantees optimal acceptance length for this block-level transport problem. This formulation leads to a clear improvement over the previously used token-level verification methods.

The Proposed Algorithm: Specblock

Our proposed algorithm, Specblock, efficiently computes the acceptance length without requiring additional calls to both the small and large models. It first attempts a maximal coupling on the entire block. If rejected, it engages in a backward induction process, deciding on partial acceptance based on the remaining and rejected probability masses for continuations of the draft. This backward induction ensures that the accepted tokens and subsequent corrections align closely with the distribution defined by the large model while efficiently computing the conditional distributions for corrected tokens.

Experimental Validation

We compared Specblock against the standard Spectoken algorithm across various datasets and tasks, including language modeling, reasoning queries, summarization, and translation tasks. Our experiments reveal consistent improvements in both block efficiency and wall clock speedup with Specblock. Particularly, we observed larger improvements for larger block lengths, showcasing the scalability of our approach.

Theoretical Justification

A formal analysis of our approach reveals that Specblock is optimal concerning the formulated block-level optimal transport problem. It provides the maximum expected accepted length, thereby offering theoretical underpinning to the empirical improvements observed.

Future Directions

Our work opens several avenues for future exploration. Notably, the combination of optimizing the drafting phase with our improved draft verification algorithm presents a promising direction for further enhancing speculative decoding's efficiency. Additionally, exploring the implications of block-level verification beyond speculative decoding in the broader context of accelerating LLMs warrants attention.

Conclusion

Specblock represents a significant advancement in the pursuit of efficient speculative decoding by optimizing the draft verification phase through block-level verification. This approach not only achieves theoretical optimality but also demonstrates practical improvements in speedup across a spectrum of tasks and datasets. As LLMs continue to grow in size and computational demand, innovations like Specblock will be vital in making these models more accessible and practical for a broader range of applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.