Emergent Mind

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

(2407.14057)
Published Jul 19, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

The inference of transformer-based LLMs consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens. For long prompts, the KV cache must be computed for all tokens during the prefilling stage, which can significantly increase the time needed to generate the first token. Consequently, the prefilling stage may become a bottleneck in the generation process. An open question remains whether all prompt tokens are essential for generating the first token. To answer this, we introduce a novel method, LazyLLM, that selectively computes the KV for tokens important for the next token prediction in both the prefilling and decoding stages. Contrary to static pruning approaches that prune the prompt at once, LazyLLM allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Extensive experiments on standard datasets across various tasks demonstrate that LazyLLM is a generic method that can be seamlessly integrated with existing language models to significantly accelerate the generation without fine-tuning. For instance, in the multi-document question-answering task, LazyLLM accelerates the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy.

LazyLLM\ framework progressively prunes tokens, reducing computations while retaining performance.

Overview

  • LazyLLM introduces a dynamic token pruning technique that optimizes the prefilling and decoding stages of LLM inference by pruning less important tokens based on attention scores.

  • The method utilizes a progressive, layer-wise token pruning strategy and an additional Aux Cache mechanism to ensure that pruned tokens can be efficiently reintroduced without redundant calculations, effectively reducing computational load.

  • Experimental results on models like Llama 2 (7B) and XGen (7B) using the LongBench benchmark demonstrate significant reductions in time-to-first-token (TTFT) and overall inference time, while maintaining high accuracy.

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

The paper "LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference" addresses the computational challenges associated with LLMs in scenarios involving long prompts. The authors introduce LazyLLM, a novel technique to dynamically prune tokens that are less important for generating subsequent tokens, thereby optimizing both the prefilling and decoding stages of LLM inference.

Introduction and Problem Statement

The inference process for transformer-based LLMs comprises two primary stages: the prefilling stage and the decoding stage. During the prefilling stage, the entire Key-Value (KV) cache of prompt tokens is computed to generate the first token. The quadratic increase in computation with respect to the number of tokens makes this stage particularly time-consuming for long prompts. In contrast, the decoding stage benefits from already computed KV caches, making subsequent token predictions computationally less intensive. Hence, optimizing the prefilling stage is crucial for reducing the time-to-first-token (TTFT) and overall latency in LLM inference.

Methodology: LazyLLM

LazyLLM diverges from static pruning methods by dynamically selecting different subsets of tokens at various steps of the sequence generation, including those that might have been pruned in previous steps. This dynamic token pruning is based on the attention scores from the prior transformer layer, thus retaining tokens that are more relevant for the next token prediction.

Progressive Token Pruning:

  • The authors utilize a layer-wise progressive pruning strategy, which prunes tokens at multiple transformer layers progressively, thereby reducing computation load gradually rather than all at once.
  • Tokens are pruned based on their attention score, where tokens with lower scores are deemed less relevant.
  • Importantly, the model allows previously pruned tokens to be revived in later steps if they become relevant, ensuring minimal loss in predictive performance.

Aux Cache Mechanism:

  • To handle scenarios where pruned tokens need to be recomputed in later steps, the authors introduce an additional caching mechanism called Aux Cache.
  • This cache stores hidden states of pruned tokens, allowing them to be efficiently reintroduced into the computation without redundant calculations.
  • This ensures that the overhead of LazyLLM never exceeds the baseline inference time, making it both effective and efficient.

Experimental Results

The authors evaluated LazyLLM on two models: Llama 2 (7B) and XGen (7B), using the LongBench benchmark, which includes tasks like single- and multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion.

Key Findings:

  • LazyLLM notably reduced TTFT by 2.34x in the multi-document QA task for the LLama 2 7B model, while maintaining accuracy.
  • The method consistently offered better TTFT speedup compared to baselines like random token drop, static token pruning, and prompt compression, without necessitating any fine-tuning.
  • The proposed technique resulted in reduced computation in both prefilling and decoding stages, offering a significant reduction in overall generation time.

Implications and Future Directions

Practical Implications:

  • LazyLLM provides a training-free approach to enhance inference efficiency in long-context applications of LLMs, making it readily deployable in real-world scenarios.
  • The method's universality allows it to be integrated with existing transformer architectures without necessitating modifications or retraining, thus lowering the barrier for implementation.

Theoretical Implications:

  • This work encourages a re-evaluation of the necessity of all tokens in the inference process and proposes a dynamic approach to token relevance assessment.
  • Future research could delve deeper into optimizing the progressive pruning layers and refining the criteria for token importance, further improving efficiency without compromising accuracy.

Speculations for Future Developments:

  • Improvements in attention mechanisms and token pruning strategies could further enhance the efficiency of generative models.
  • The advancement of more sophisticated caching mechanisms could mitigate the computational burdens associated with dynamic token revivals.
  • Exploring LazyLLM's applicability to different architectures and more comprehensive benchmarks could solidify its utility across various NLP tasks.

In summary, LazyLLM offers a significant contribution to the domain of efficient LLM inference, particularly under long-context scenarios, and sets the stage for future innovations in the dynamic optimization of generative models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube