Multi-Candidate Speculative Decoding (2401.06706v1)

Published 12 Jan 2024 in cs.CL

Abstract: LLMs have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a sequence of tokens) from a fast draft model that is then verified in parallel by the target model. However, the acceptance rate of candidate tokens receives limitations from several factors, such as the model, the dataset, and the decoding setup. This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification. We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model. Our approach shows significant improvements in acceptance rates on multiple datasets and models, consistently outperforming standard speculative decoding.

References (27)

Citations (10)

View on Semantic Scholar

Summary

The paper presents multi-candidate speculative decoding to significantly improve acceptance rates and reduce inference latency compared to standard methods.
It employs multi-candidate sampling and a novel tree attention mechanism for efficient batched verification while ensuring target model output fidelity.
Empirical evaluations across various LLMs and datasets demonstrate consistent wall-clock speedups, often exceeding 1.5–2x, even under fine-tuned or out-of-distribution settings.

Multi-Candidate Speculative Decoding: An Expert Overview

The paper "Multi-Candidate Speculative Decoding" (2401.06706) addresses the bottleneck of high latency in autoregressive LLM generation by extending speculative decoding to a multi-candidate regime. The authors propose algorithms and architectural modifications that systematically improve acceptance rates and wall-clock efficiency over the standard speculative decoding paradigm, while retaining the output distribution fidelity of the target model.

Motivation and Background

Speculative Decoding (SD) leverages a fast, low-cost draft model to propose sequences, which are then verified or rejected by the more expensive target LLM. The inference speedup depends crucially on the acceptance rate—the likelihood that the target model agrees with the draft proposal at each token. SD’s efficiency is hindered when the candidate acceptance rate is low due to distributional mismatch, longer prompts, or model fine-tuning. Notably, fine-tuning the target but not the draft can drastically worsen acceptance rates, particularly on datasets divergent from the draft model's training distribution.

Methodological Contribution

Multi-Candidate Sampling and Verification:

The paper introduces a strategy in which, at each decoding step, the draft model samples $k$ candidate tokens. These candidates are then verified in batch using the target model, utilizing parallel hardware to increase throughput. Two key algorithmic advances underpin this approach:

Multi-Candidate Speculative Sampling (MCSS):

MCSS generalizes the standard SD algorithm to handle $k$ candidates per position, ensuring that accepted outputs remain distributed identically to the target model by sequentially checking acceptances and maintaining proper residual probability normalization for unaccepted candidates. The algorithm is proven (Appendices A/B) to maintain the correct distribution through carefully designed acceptance and rejection procedures, both with and without replacement sampling from the draft model.

Tree Attention Mechanism:

Batching candidates introduces cache redundancy (due to duplicated keys/values across candidate sequences) that can negate batching speedups. The authors adapt the Tree Attention concept, arranging candidate verifications as branches in a tree and using an attention mask to avoid information contamination. This allows all candidate continuations to share prefix state, minimizing memory copy and communication overhead—critical for high-throughput inference.

Empirical Evaluation

Acceptance Rate and Efficiency Gains

Across multiple LLM architectures (LLaMA, Vicuna, LLaMA2, OPT), datasets (Alpaca, WMT EnDe), and draft–target pairings, the multi-candidate approach yields substantial improvements:

Acceptance rate increases with $k$ : For $k=4$ , acceptance rates improved, for example, from 0.76→0.88 (LLaMA-13B on Alpaca) and 0.49→0.67 (Vicuna-13B on Alpaca).
Speedup in wall-clock time:

The proposed MCSD method outperforms standard SD, with walltime speedup commonly exceeding 1.5–2x depending on model size and dataset, even under fine-tuned or OOD (WMT) settings.

Block Efficiency:

Increasing $k$ brings diminishing returns due to overheads and marginal improvements in acceptance. The empirical data suggests optimal $k$ is task and hardware dependent but modest (e.g., $k=4$ or $k=8$ ) for most regimes.

Architectural and Algorithmic Insights

Budget Configurations:

Monotonically decreasing $k$ configurations (allocating more candidates earlier in the sequence) yield superior efficiency due to compounding acceptance dependencies.

Tree Attention:

Ablation shows Tree Attention has an outsized impact on practical speed, curbing otherwise prohibitive KV cache replication and communication costs compared to naively batching.

Generality and Extensibility:

Empirical results demonstrate robust gains across a variety of LLM targets—including fine-tuned, base, and externally trained models—as well as when stacking with other acceptance-rate-improving methods (e.g., draft fine-tuning).

Implications and Future Directions

Practical Deployment:

The presented methods are compatible with prevalent LLM serving frameworks and offer immediate efficiency gains for production-grade text generation services constrained by hardware or budget. The reliance on small draft models and batched verifications integrates well with GPU-based and distributed inference infrastructures.

Limitations and Scaling:

While the gains are robust, acceleration is bounded by the diminishing returns in acceptance as $k$ increases and by additional compute/memory costs in draft model invocations or candidate batching. Scenarios with highly misaligned draft/target pairs (due to domain shift or heavy fine-tuning) still suffer diminished speedups, albeit less than standard SD.

Broader AI Impact and Future Work:

This work points towards a rich space for further LLM inference acceleration:

Combining MCSD with online draft model adaptation or knowledge distillation to drive acceptance rates even higher dynamically.
More general batched speculative decoding for structured or multi-modal outputs.
Extending Tree Attention to distributed/shared memory multigeneration setups at scale.
Integrating candidate pruning or probabilistic routing to further optimize draft–target allocation.

Conclusion

Multi-Candidate Speculative Decoding provides a principled, empirically validated, and practically deployable means for improving autoregressive LLM inference efficiency. By leveraging parallel candidate verification and architectural innovations such as Tree Attention, it sets a foundation for further acceleration and efficiency research in generative model serving, and is expected to become integral in high-performance LLM deployment pipelines.