Block Verification Accelerates Speculative Decoding (2403.10444v2)

Published 15 Mar 2024 in cs.LG, cs.CL, cs.DS, cs.IT, and math.IT

Abstract: Speculative decoding is an effective method for lossless acceleration of LLMs during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.

References (24)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Specblock, a novel block-level draft verification algorithm that formulates the problem as an optimal transport to maximize accepted tokens.
It replaces traditional token-level verification with a backward induction process that efficiently aligns drafts with the target model’s probability distribution.
Experimental results demonstrate enhanced block efficiency and significant wall clock speedup across various tasks without additional computational expense.

Improving Speculative Decoding with Block-Level Draft Verification

Introduction

Speculative decoding has become a prominent approach for accelerating the inference of LLMs by drafting blocks of tokens using a smaller model and verifying these tokens with the larger target model in parallel. However, the prevalent draft verification algorithm, Spectoken, verifies tokens independently, which may not be optimal. In this work, we present a novel formulation of the draft verification step as a block-level optimal transport problem, leading to a more efficient draft verification algorithm that enhances the speedup of speculative decoding without incurring extra computational costs.

The Optimal Transport Problem Formulation

The crux of our approach hinges on the formulation of the draft verification process as an optimal transport problem at the block level, with the aim of maximizing the expected number of accepted tokens in one draft block—directly correlating with the decoding speed-up. We propose a verification algorithm that guarantees optimal acceptance length for this block-level transport problem. This formulation leads to a clear improvement over the previously used token-level verification methods.

The Proposed Algorithm: Specblock

Our proposed algorithm, Specblock, efficiently computes the acceptance length without requiring additional calls to both the small and large models. It first attempts a maximal coupling on the entire block. If rejected, it engages in a backward induction process, deciding on partial acceptance based on the remaining and rejected probability masses for continuations of the draft. This backward induction ensures that the accepted tokens and subsequent corrections align closely with the distribution defined by the large model while efficiently computing the conditional distributions for corrected tokens.

Experimental Validation

We compared Specblock against the standard Spectoken algorithm across various datasets and tasks, including LLMing, reasoning queries, summarization, and translation tasks. Our experiments reveal consistent improvements in both block efficiency and wall clock speedup with Specblock. Particularly, we observed larger improvements for larger block lengths, showcasing the scalability of our approach.

Theoretical Justification

A formal analysis of our approach reveals that Specblock is optimal concerning the formulated block-level optimal transport problem. It provides the maximum expected accepted length, thereby offering theoretical underpinning to the empirical improvements observed.

Future Directions

Our work opens several avenues for future exploration. Notably, the combination of optimizing the drafting phase with our improved draft verification algorithm presents a promising direction for further enhancing speculative decoding's efficiency. Additionally, exploring the implications of block-level verification beyond speculative decoding in the broader context of accelerating LLMs warrants attention.

Conclusion

Specblock represents a significant advancement in the pursuit of efficient speculative decoding by optimizing the draft verification phase through block-level verification. This approach not only achieves theoretical optimality but also demonstrates practical improvements in speedup across a spectrum of tasks and datasets. As LLMs continue to grow in size and computational demand, innovations like Specblock will be vital in making these models more accessible and practical for a broader range of applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SZiteng/status/1770845097275339076

https://twitter.com/fly51fly/status/1769639573926146465

https://twitter.com/eprintbro/status/1769894037173977305

https://twitter.com/gm8xx8/status/1769549596483526749

https://twitter.com/AlgorithmPapers/status/1769642357161406943

https://twitter.com/knishimae0531/status/1769930160416162080