Emergent Mind

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

(2403.09919)
Published Mar 14, 2024 in cs.CL and cs.LG

Abstract

In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving LLMs. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.

Medusa and recurrent drafters compared; latter shows higher predictive accuracy with lower memory usage.

Overview

  • ReDrafter introduces an innovative speculative decoding approach that leverages a recurrent dependency to enhance the efficiency of LLMs during inference.

  • The model simplifies speculative decoding by using a single drafting head for predicting multiple candidate tokens, reducing complexity and parameter count.

  • ReDrafter employs beam search and a novel dynamic tree attention mechanism to optimize candidate sequence generation, improving prediction quality and computational efficiency.

  • Experimental results demonstrate ReDrafter's superior performance in both speed and accuracy compared to existing speculative decoding methods, marking a significant advancement in LLM efficiency.

Recurrent Drafter for Fast Speculative Decoding in LLMs

Introduction

Recent advancements in LLMs have sparked interest in enhancing their efficiency, particularly during inference. Speculative decoding has emerged as a promising strategy to accelerate LLM inference by using smaller, draft models to predict preliminary candidate tokens. This paper introduces the Recurrent Drafter (ReDrafter), an innovative approach that leverages the strengths of speculative decoding. Unlike existing models that require either multiple models or complex dependencies, ReDrafter employs a single, lightweight drafting head with a recurrent dependency, enabling faster and more efficient speculative decoding.

Proposed Method

The core innovation of ReDrafter is in its drafting strategy, which merges insights from RNN language models with speculative decoding. The method utilizes a single set of parameters for the draft head, allowing it to predict multiple tokens in sequence with dependencies accounted for, thereby reducing the complexity traditionally associated with speculative decoding models.

Model Definition

ReDrafter adopts a single-model strategy, integrating embeddings of historical tokens as recurrent inputs. This approach not only simplifies the model but also enhances its predictive ability by considering the sequence's context. A notable departure from the Medusa framework, ReDrafter avoids the creation of a data-dependent attention structure, opting instead for the simplicity of beam search to eliminate suboptimal candidate sequences early in the inference process.

Beam Search and Dynamic Tree Attention

Beam search plays a pivotal role in generating candidate tokens for verification. ReDrafter's strategy enables a direct and efficient method to identify promising candidate sequences, reducing the verification workload on the target model. Furthermore, the model introduces a dynamic tree attention mechanism, an algorithmic enhancement that leverages beam search outcomes to optimize computation and memory usage dynamically, a significant advancement over static tree structures proposed in earlier models.

Experiments

The evaluation of ReDrafter focuses on its training efficiency and inference performance, primarily against the backdrop of existing LLMs and speculative decoding approaches. The paper details an extensive comparison between the proposed method and current strategies, demonstrating ReDrafter's superior speed and reduction in parameter count without sacrificing prediction quality.

Training and Inference Performance

Assessments indicate that ReDrafter not only outperforms its speculative decoding counterparts in predictive accuracy but also achieves this with a substantially lower parameter count. Specifically, the model attains higher speed-up factors (up to 3.28 times) compared to existing methods, illustrating its efficacy in reducing computational overhead during LLM inference.

Discussion and Future Directions

The recurrent drafter represents a significant step forward in speculative decoding, marrying the simplicity of single-model designs with the efficiency of recursive dependencies. Its ability to dynamically construct attention structures based on beam search outcomes further distinguishes it from prior models, offering a more flexible and effective approach to speculative decoding.

While ReDrafter demonstrates considerable promise, the paper also acknowledges potential areas for future development, such as exploring more complex model structures and joint training mechanisms to further enhance performance.

Conclusion

In conclusion, the introduction of ReDrafter marks a noteworthy advancement in improving the efficiency of LLMs through speculative decoding. By combining the benefits of recurrent neural network models with innovative beam search and dynamic tree attention techniques, this approach sets a new standard for speculative decoding, offering a pathway towards more efficient and effective utilization of LLMs in real-world applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.