MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490v2)
Abstract: The computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.
- Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems, 6:114–127, 2024.
- Phi-3 technical report: A highly capable language model locally on your phone. ArXiv, abs/2404.14219, 2024.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. 2023.
- Unlimiformer: Long-range transformers with unlimited length input. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Qwen technical report. ArXiv preprint, abs/2309.16609, 2023.
- Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020.
- Codeplan: Repository-level coding using LLMs and planning. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Generating long sequences with sparse transformers. ArXiv preprint, abs/1904.10509, 2019.
- Peek across: Improving multi-document modeling via cross-document question-answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1989, 2023.
- Extending context window of large language models via positional interpolation. ArXiv preprint, abs/2306.15595, 2023.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024.
- DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024.
- Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Forty-first International Conference on Machine Learning, 2024.
- Sequence can secretly tell you what to discard. ArXiv preprint, abs/2404.15949, 2024.
- Longnet: Scaling transformers to 1,000,000,000 tokens. ArXiv preprint, abs/2307.02486, 2023.
- Attention is naturally sparse with gaussian distributed input. ArXiv preprint, abs/2404.02690, 2024.
- Get more with LESS: Synthesizing recurrence with KV cache compression for efficient LLM inference. In Forty-first International Conference on Machine Learning, 2024.
- LongroPE: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, 2024.
- Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, 2024.
- Yao Fu. Challenges in deploying long-context transformers: A theoretical peak performance analysis. ArXiv preprint, abs/2405.08944, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
- Gradient. Llama-3 8b instruct gradient 4194k (v0.1), 2024.
- Model tells you what to discard: Adaptive kv cache compression for llms. In The Twelfth International Conference on Learning Representations, 2024.
- Chatglm: A family of large language models from glm-130b to glm-4 all tools. ArXiv preprint, abs/2406.12793, 2024.
- Block transformer: Global-to-local language modeling for fast inference. ArXiv preprint, abs/2406.02657, 2024.
- Ruler: What’s the real context size of your long-context language models? ArXiv preprint, abs/2404.06654, 2024.
- LM-infinite: Zero-shot extreme length generalization for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3991–4008, Mexico City, Mexico, 2024. Association for Computational Linguistics.
- Mistral 7b. ArXiv preprint, abs/2310.06825, 2023.
- Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. ArXiv preprint, abs/2309.14509, 2023.
- Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, 2023.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024.
- Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2023.
- Greg Kamradt. Needle in a haystack - pressure testing llms, 2023.
- Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020.
- Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- On the expressive power of self-attention matrices. ArXiv preprint, abs/2106.03764, 2021.
- On the expressive flexibility of self-attention matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8773–8781, 2023.
- Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024.
- Snapkv: Llm knows what you are looking for before generation. ArXiv preprint, abs/2404.14469, 2024.
- Jamba: A hybrid transformer-mamba language model. ArXiv preprint, abs/2403.19887, 2024.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- nnscaler: Constraint-guided parallelization plan generation for deep learning training. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, 2024.
- Dynamic sparse attention for scalable transformer acceleration. IEEE Transactions on Computers, 71(12):3165–3178, 2022.
- Deja vu: Contextual sparsity for efficient LLMs at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2023.
- World model on million-length video and language with ringattention. ArXiv preprint, abs/2402.08268, 2024.
- Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Long-context llms struggle with long in-context learning. ArXiv preprint, abs/2404.02060, 2024.
- Iceformer: Accelerated inference with long-sequence transformers on CPUs. In The Twelfth International Conference on Learning Representations, 2024.
- Leave no context behind: Efficient infinite context transformers with infini-attention. ArXiv preprint, abs/2404.07143, 2024.
- Dynamic memory compression: Retrofitting LLMs for accelerated inference. In Forty-first International Conference on Machine Learning, 2024.
- Xgen-7b technical report. ArXiv preprint, abs/2309.03450, 2023.
- Transformers are multi-state rnns. ArXiv preprint, abs/2401.06104, 2024.
- RWKV: Reinventing RNNs for the transformer era. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore, 2023. Association for Computational Linguistics.
- Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023.
- Fast attention over long sequences with dynamic sparse flash attention. Advances in Neural Information Processing Systems, 36, 2024.
- Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
- Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024.
- Sparq attention: Bandwidth-efficient LLM inference. In Forty-first International Conference on Machine Learning, 2024.
- Samba: Simple hybrid state space models for efficient unlimited context language modeling. ArXiv preprint, abs/2406.07522, 2024.
- Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
- Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. ArXiv preprint, abs/2404.11912, 2024.
- Retentive network: A successor to transformer for large language models. ArXiv preprint, abs/2307.08621, 2023.
- You only cache once: Decoder-decoder architectures for language models. ArXiv preprint, abs/2405.05254, 2024.
- Sparsebert: Rethinking the importance analysis in self-attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 9547–9557. PMLR, 2021.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need. ArXiv preprint, abs/1911.02150, 2019.
- UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2023.
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
- Triton implementation of the flash attention v2 algorithm. Technical report, OpenAI, 2023.
- Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- QUEST: Query-aware sparsity for efficient long-context LLM inference. In Forty-first International Conference on Machine Learning, 2024.
- Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, 2023.
- Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. ArXiv preprint, abs/2406.18139, 2024.
- Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024.
- Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. ArXiv preprint, abs/2402.04617, 2024.
- Yi: Open foundation models by 01. ai. ArXiv preprint, abs/2403.04652, 2024.
- A unified implicit attention formulation for gated-linear recurrent sequence models. ArXiv preprint, abs/2405.16504, 2024.
- ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. ArXiv preprint, abs/2402.13718, 2024.
- Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 331–347, 2023.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.