Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 167 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 429 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference (2407.14057v1)

Published 19 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The inference of transformer-based LLMs consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows LLMs to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing LLMs to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces LazyLLM which dynamically selects and prunes tokens based on attention scores to reduce computational overhead during LLM inference.
It employs an Auxiliary Cache to efficiently revive pruned tokens, ensuring the model maintains comparable accuracy without extra fine-tuning.
Experimental results demonstrate a TTFT speedup of 2.34× on Llama 2 in multi-document QA tasks by reducing unnecessary token computations.

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Introduction

The paper "LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference" explores methods to address the inefficiencies inherent in transformer-based LLMs, specifically targeting the prefilling stage during inference. Traditionally, this stage involves computing key-value (KV) pairs for all tokens in long prompts, a process that can slow down the generation of the first token. The authors propose LazyLLM, a technique that dynamically chooses influential tokens in different generation steps, selectively computing their KV pairs and deferring less important tokens to minimize computational overhead without compromising accuracy.

Figure 1: Prompt-based LLM inference can be divided into two sequential stages: prefilling and decoding. For long prompts, the first token generation during prefilling stage could be slow.

LazyLLM Framework

LazyLLM introduces an innovative approach to token pruning during both prefilling and decoding stages. Unlike static methods that compress prompts in one step, LazyLLM dynamically adjusts token subsets depending on their relevance throughout various stages of generation. This is achieved by leveraging attention scores to ascertain token importance, progressively pruning less critical tokens at each transformer layer.

Figure 2: We visualize the attention scores of input tokens in the prompt w.r.t to the next token for each layer of Llama 2 7B.

LazyLLM employs a layer-wise pruning strategy, using the attention maps to calculate confidence scores for each token, facilitating an adaptive pruning decision based on the percentile selection strategy.

Token Pruning and Aux Cache

The method uses token pruning to streamline computations: unimportant tokens are deferred, and essential tokens are prioritized. During prefilling, LazyLLM identifies tokens critical for immediate token prediction. The paper introduces an Auxiliary Cache (Aux Cache) to store the hidden states of pruned tokens, ensuring these tokens can be revived efficiently in later steps without redundant recomputation.

Figure 3: Comparison between standard LLM and LazyLLM, showcasing its efficiency in token prediction.

LazyLLM's progressive pruning across transformer layers helps balance between computational savings and maintaining model accuracy. The integration of Aux Cache guarantees that LazyLLM's runtime never exceeds that of the baseline model while optimizing computational resources.

Experimental Results and Analysis

The implementation of LazyLLM on models like Llama 2 and XGen demonstrated notable improvements in Time-to-First-Token (TTFT) speedup across various tasks without requiring model fine-tuning. For example, in multi-document question-answering tasks, LazyLLM achieved a TTFT speedup of 2.34× on the Llama 2 model, while maintaining performance levels comparable to or slightly reduced compared to baseline methods.

Figure 4: Overview of the LazyLLM framework showing how the model prunes and selects different token subsets.

Importantly, LazyLLM's integration does not necessitate any alterations to model parameters, ensuring broad applicability across different LLM architectures.

Figure 5: TTFT speedup accuracy comparison for Llama 2 7B across different tasks.

The results demonstrate LazyLLM's effectiveness in reducing the percentage of computed prompt tokens, leading to faster generation times and decreased overall computation.

Figure 6: Effect of the locations of pruning layers, and the number of tokens pruned.

Drop Rate Analysis

The paper provides an analysis of token pruning effects, concluding that later transformer layers exhibit less sensitivity to pruning, and allowing more aggressive token reduction at these stages can enhance speed without significant performance deterioration.

Figure 7: Statistics on number of tokens processed during generation using our LazyLLM technique with Llama 2 7B.

Conclusion

LazyLLM offers a promising solution to optimize LLM inference for long contexts, by dynamically pruning tokens and selectively computing KV pairs. It ensures reduced computational costs and improved TTFT, promising to be seamlessly integrated into existing models without a need for tuning, broadening its scope of application. Future work could explore extending these techniques to other architectures and further refining dynamic pruning strategies to enhance efficiency and performance.