Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 58 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 12 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 179 tok/s Pro

GPT OSS 120B 463 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (2401.10774v3)

Published 19 Jan 2024 in cs.LG and cs.CL

Abstract: LLMs employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

References (51)

Citations (159)

View on Semantic Scholar

Summary

The paper introduces Medusa, a framework that adds multiple decoding heads to enable parallel token prediction and accelerate LLM inference.
Medusa-1 and Medusa-2 offer distinct strategies—freezing or jointly fine-tuning the backbone—ensuring preserved model quality while improving speed.
Extensive experiments demonstrate speed improvements from 2.2x up to 3.6x, highlighting Medusa’s scalability even for single-instance, local LLM applications.

Introduction

Leverage in the computational power and memory of contemporary accelerators has hit a plateau when it comes to LLMs. The sequential nature of the auto-regressive decoding process in LLMs causes this bottleneck, which underutilizes the available computing capabilities of these technological workhorses. Speculative decoding has been introduced to address these inefficiencies. However, a significant roadblock has been the difficulties in deploying draft models that predict a sequence of tokens, which the larger LLMs then refine. This scenario is exactly where the Medusa framework comes into play, offering a straightforward solution to the intricate challenge of accelerating LLM inference.

Medusa Framework

The primary innovation introduced with Medusa is the addition of multiple decoding heads to the backbone LLM, which enables the prediction of multiple subsequent tokens in a parallel fashion. These heads are designed to be fine-tuned, ensuring they are closely aligned with the parent LLM in their predictions. Two distinct procedures have been outlined for integrating these predictive heads: Medusa-1 and Medusa-2. Medusa-1 pertains to a setting where the backbone LLM remains frozen during training, thus ensuring no alteration to its core capabilities while accelerating inference speed. Medusa-2 involves a more resource-intensive fine-tuning where the additional heads are trained together with the backbone LLM, potentially achieving even higher efficiency gains.

Addressing Challenges with Extensions

Several obstacles could impede the Medusa framework's widescale adoption, such as situations lacking sufficient training data. To tackle this, the researchers have designed a self-distillation protocol, which cleverly uses the LLM to generate training data for the Medusa heads. They have also introduced a 'typical acceptance scheme' as an alternative to rejection sampling, used in speculative decoding, to select the most plausible predictions from the Medusa heads. This approach maintains the quality of generation while potentially increasing the rate at which tokens can be accepted during the decoding process.

Experimental Results

In their comprehensive experiments, the researchers assessed Medusa on various model sizes and configurations. The findings are significant – Medusa-1 achieves more than a 2.2 times speedup in LLM inference with no loss in quality, whereas Medusa-2 pushes this further, attaining speed improvements ranging from 2.3 to 3.6 times. Moreover, another key takeaway is that their method can scale across different models and is particularly adept in scenarios with a batch size of one, which happens to represent the use case of hosting LLMs locally for personal applications.

Conclusion

Medusa has set a new precedent for inference acceleration in LLMs without compromising generation quality. Its versatile training approaches cater to diverse computational resource scenarios, and the proposed extensions effectively confront common problems when employing accelerated inference methods. The code for Medusa has been made available to the public, inviting collaborative efforts to further refine and incorporate the framework into different serving systems.