Emergent Mind

Abstract

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

MInference speeds up long-context LLM inference by 10x with dynamic sparse attention, compared to baselines.

Overview

  • MInference 1.0 addresses the computational bottleneck in LLMs with extended context windows by optimizing the pre-filling stage of LLMs using dynamic sparse attention mechanisms.

  • The paper identifies three characteristic attention patterns (A-shape, Vertical-Slash, Block-Sparse) and employs kernel-aware searches to determine the optimal configuration for efficiency without sacrificing model accuracy.

  • Experimental results show that MInference achieves up to 10x speedup in pre-filling long-context LLMs while maintaining accuracy, demonstrating its applicability in various tasks and datasets.

Overview of MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

The paper titled "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention" addresses a critical bottleneck in the deployment of LLMs with extended context windows. The authors focus on optimizing the pre-filling stage of LLMs that process long sequences of tokens up to 1 million tokens, particularly mitigating the computational challenges posed by the quadratic complexity of the attention mechanism.

Key Contributions and Methodology

Identification of Attention Patterns

The paper identifies three characteristic patterns in the attention matrices of long-context LLMs—namely, A-shape, Vertical-Slash (VS), and Block-Sparse patterns. These patterns reveal spatial aggregations of sparse attention weights, which the authors exploit to perform efficient sparse computations on GPUs.

  1. A-shape Pattern: Concentrates on initial tokens and local windows.
  2. Vertical-Slash Pattern: Combines vertical attention lines and fixed-interval slash lines.
  3. Block-Sparse Pattern: Focuses on clusters of top attention weights grouped in blocks.

The authors develop a kernel-aware search method to determine the optimal attention pattern for each head, balancing computational efficiency with retention of model accuracy. This search is performed offline to establish the most effective pattern configurations.

Dynamic Sparse Attention Calculations

During inference, MInference dynamically builds sparse indices for attention heads based on the identified patterns. This adaptation considers the specific input to generate the most efficient sparse mask. For example, a partial computation using the last few query vectors aids in estimating the critical indices of vertical and slash lines for the VS pattern. Similarly, for block-sparse heads, mean pooling on query and key vectors approximates the most significant blocks to include in the sparse mask.

The subsequent computation employs optimized GPU kernels, leveraging sparse compilation technologies like PIT, Triton, and FlashAttention to accelerate the attention mechanism, significantly reducing latency during the pre-filling stage.

Experimental Validation

The authors conduct extensive experiments on several state-of-the-art LLMs (LLaMA-3-8B, GLM-4-9B, and Yi-9B, among others) across diverse benchmarks, including InfiniteBench, RULER, and Needle In A Haystack, as well as language modeling tasks with PG-19. Key findings include:

  • Accuracy Maintenance: MInference maintains or even slightly enhances the long-context capabilities of the LLMs compared to full attention baselines.
  • Significant Speedups: It achieves up to 10x speedup for 1M token contexts on an Nvidia A100 GPU, reducing pre-filling latency from 30 minutes to 3 minutes while sustaining model accuracy.
  • Generalization: The method exhibits robust performance across various tasks and datasets, demonstrating its applicability.

Implications and Future Directions

The practical implications of this research are profound. By substantially accelerating the pre-filling stage without compromising accuracy, MInference facilitates the deployment of long-context LLMs in real-world applications that require processing large contexts, such as legal document analysis, large-scale code understanding, and comprehensive textual queries.

This method also reduces the computational cost associated with LLMs, making them more accessible and feasible for a broader range of users and applications. Furthermore, the compatibility of MInference with existing LLM architectures without necessitating additional training adjustments highlights its practical utility.

Future developments in this domain could explore further optimizing the balance between computational overhead and inference efficiency. Additionally, integrating MInference with other inference optimization techniques, such as KV cache compression methods like SnapKV, could yield further improvements in both latency and efficiency.

Moreover, dynamic sparse attention techniques could be extended to other forms of neural networks beyond autoregressive models, such as encoder-decoder models or multi-modal LLMs, potentially revealing broader applications and efficiency improvements.

In conclusion, MInference represents a significant stride towards efficient long-context processing in LLMs, providing a scalable approach to handling the ever-expanding demands of modern AI applications. This work lays the groundwork for ongoing innovations in sparse computation and efficient inference, promising enhanced performance and reduced costs for future AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube