Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

139 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490v2)

Published 2 Jul 2024 in cs.CL and cs.LG

Abstract: The computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

References (89)

Citations (30)

View on Semantic Scholar

Summary

The paper presents a dynamic sparse attention method that identifies A-shape, vertical-slash, and block-sparse patterns to optimize long-context LLM pre-filling.
It employs a kernel-aware search and GPU-optimized sparse computation to reduce pre-filling latency from 30 minutes to 3 minutes on 1M token contexts.
Experimental results demonstrate that MInference maintains model accuracy and generalizes across various LLMs, including LLaMA, GLM, and Yi.

Overview of MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

The paper "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention" addresses a critical bottleneck in the deployment of LLMs with extended context windows. The authors focus on optimizing the pre-filling stage of LLMs that process long sequences of tokens up to 1 million tokens, particularly mitigating the computational challenges posed by the quadratic complexity of the attention mechanism.

Key Contributions and Methodology

Identification of Attention Patterns

The paper identifies three characteristic patterns in the attention matrices of long-context LLMs—namely, A-shape, Vertical-Slash (VS), and Block-Sparse patterns. These patterns reveal spatial aggregations of sparse attention weights, which the authors exploit to perform efficient sparse computations on GPUs.

A-shape Pattern: Concentrates on initial tokens and local windows.
Vertical-Slash Pattern: Combines vertical attention lines and fixed-interval slash lines.
Block-Sparse Pattern: Focuses on clusters of top attention weights grouped in blocks.

The authors develop a kernel-aware search method to determine the optimal attention pattern for each head, balancing computational efficiency with retention of model accuracy. This search is performed offline to establish the most effective pattern configurations.

Dynamic Sparse Attention Calculations

During inference, MInference dynamically builds sparse indices for attention heads based on the identified patterns. This adaptation considers the specific input to generate the most efficient sparse mask. For example, a partial computation using the last few query vectors aids in estimating the critical indices of vertical and slash lines for the VS pattern. Similarly, for block-sparse heads, mean pooling on query and key vectors approximates the most significant blocks to include in the sparse mask.

The subsequent computation employs optimized GPU kernels, leveraging sparse compilation technologies like PIT, Triton, and FlashAttention to accelerate the attention mechanism, significantly reducing latency during the pre-filling stage.

Experimental Validation

The authors conduct extensive experiments on several state-of-the-art LLMs (LLaMA-3-8B, GLM-4-9B, and Yi-9B, among others) across diverse benchmarks, including InfiniteBench, RULER, and Needle In A Haystack, as well as LLMing tasks with PG-19. Key findings include:

Accuracy Maintenance: MInference maintains or even slightly enhances the long-context capabilities of the LLMs compared to full attention baselines.
Significant Speedups: It achieves up to 10x speedup for 1M token contexts on an Nvidia A100 GPU, reducing pre-filling latency from 30 minutes to 3 minutes while sustaining model accuracy.
Generalization: The method exhibits robust performance across various tasks and datasets, demonstrating its applicability.

Implications and Future Directions

The practical implications of this research are profound. By substantially accelerating the pre-filling stage without compromising accuracy, MInference facilitates the deployment of long-context LLMs in real-world applications that require processing large contexts, such as legal document analysis, large-scale code understanding, and comprehensive textual queries.

This method also reduces the computational cost associated with LLMs, making them more accessible and feasible for a broader range of users and applications. Furthermore, the compatibility of MInference with existing LLM architectures without necessitating additional training adjustments highlights its practical utility.

Future developments in this domain could explore further optimizing the balance between computational overhead and inference efficiency. Additionally, integrating MInference with other inference optimization techniques, such as KV cache compression methods like SnapKV, could yield further improvements in both latency and efficiency.

Moreover, dynamic sparse attention techniques could be extended to other forms of neural networks beyond autoregressive models, such as encoder-decoder models or multi-modal LLMs, potentially revealing broader applications and efficiency improvements.

In conclusion, MInference represents a significant stride towards efficient long-context processing in LLMs, providing a scalable approach to handling the ever-expanding demands of modern AI applications. This work lays the groundwork for ongoing innovations in sparse computation and efficient inference, promising enhanced performance and reduced costs for future AI systems.

PDF Markdown

Tweets

https://twitter.com/osanseviero/status/1809124911077613829

https://twitter.com/_akhaliq/status/1808386075498324054

https://twitter.com/TheTuringPost/status/1810797710934728775

https://twitter.com/fly51fly/status/1809706259328954640

https://twitter.com/Grad62304977/status/1829283924146856221

https://twitter.com/Grad62304977/status/1858525591349117163

YouTube

Show All Videos

Microsoft's technique MInference (Milliontokens Inference) dramatically speeds up supported models inference of long-context tasks, while maintaining accuracy. (90 points, 10 comments)