Emergent Mind

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

(2404.18911)
Published Apr 29, 2024 in cs.CL and cs.LG

Abstract

Speculative decoding has demonstrated its effectiveness in accelerating the inference of LLMs while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.

Comparison of different speculative decoding methods using Spec-Bench data.

Overview

  • The Kangaroo framework introduces a novel method for accelerating Large Language Model (LLM) inference by using a self-draft model connected to an adapter module, reducing inference latency without extensive parameter increase.

  • An early-exiting strategy during the drafting process optimizes computational resources by ceasing operations once the confidence of the predicted token falls below a set threshold, thereby maintaining efficiency.

  • Comparative analyses show that Kangaroo outperforms other speculative decoding methods like Medusa and Lookahead in terms of speed and parameter efficiency, offering substantive theoretical and practical advancements in LLM operations.

Enhancing Speculative Decoding with the Kangaroo Framework for LLM Inference Acceleration

Introduction

The paper under review introduces the Kangaroo framework, a novel approach to accelerating Large Language Model (LLM) inference via speculative decoding. This method leverages a self-drafting mechanism using a shallow sub-network of the full model, subsequently bridged by a lightweight and efficient adapter module. Kangaroo primarily targets the reduction of inference latency while maintaining an acceptable token acceptance rate, showing particular promise by achieving a maximum speedup of 1.7× on the Spec-Bench testbench.

Technical Innovation

The core innovation within Kangaroo lies in utilizing a small, fixed sub-network of the full LLM as a self-draft model, which is then enhanced by an adapter module. This setup provides several advantages:

  1. Reduced overhead by avoiding the training of a separate draft model.
  2. Minimized increase in parameter count due to the adapter module's design.

In addition, an innovative early-exiting strategy is employed during the drafting phase. This strategy terminates the drafting process once the predicted token's confidence falls below a threshold, thus optimizing computational resources by ceasing operations on more complex tokens that may diminish efficiency.

The adapter network-specific design includes a multi-head attention mechanism followed by two normalization layers, which, notably, comprise only 11.3% of the parameters compared to similar components in other methods such as Medusa.

Performance Analysis

Kangaroo was extensively evaluated using the Spec-Bench framework, a robust setting for measuring performance enhancements across various speculative decoding implementations. The model demonstrated speedups up to 1.7×, significantly outperforming competitive approaches like Medusa-1, which has 88.7% more parameters. Such results underscore Kangaroo's efficiency in handling the self-drafting process without an extensive parameter increase.

Comparative Assessment

When compared with other methodologies such as Lookahead, Medusa, and REST, Kangaroo consistently offered superior performance in terms of both speed and efficiency. The adapter's design plays a crucial role in this, efficiently bridging the representation gap between the shallow network and the full LLM with minimal parametric enhancements.

Theoretical and Practical Implications

From a theoretical standpoint, this paper provides significant insights into effective parameter sharing within LLMs to reduce inference costs. Practically, Kangaroo offers a feasible pathway to integrating speculative decoding within existing LLM architectures without requiring extensive computational resources or retraining.

Future Prospects

The presented findings lay a solid foundation for future explorations into efficient decoding methodologies. Potential research could explore the scalability of the Kangaroo framework across even larger models or its application and adaptability in real-time language processing tasks. Additionally, further optimization of the adapter module could result in even more significant performance gains.

Conclusion

In summary, the Kangaroo framework marks a substantial step forward in speculative decoding by efficiently leveraging a self-draft model with a connected adapter module, significantly reducing inference latency. The use of a fixed shallow sub-network and an innovative early-exit mechanism during drafting phase ensures that the performance of the larger LLM is not only preserved but enhanced, allowing for rapid and cost-efficient language processing. This method opens up promising avenues for further enhancing the efficiency and applicability of LLMs across various computational environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube