ThinK: Thinner Key Cache by Query-Driven Pruning (2407.21018v3)
Abstract: LLMs have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve a 2.8x reduction in peak memory usage while maintaining nearly the same quality, enabling up to a 5x increase in batch size when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK, establishing a new baseline algorithm for efficient LLM deployment without compromising performance. Our code has been made available at https://github.com/SalesforceAIResearch/ThinK.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Eigen analysis of self-attention and its reconstruction from partial computation. arXiv preprint arXiv:2106.08823.
- What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146.
- Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.
- Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Demmel, J. W. (1997). Applied numerical linear algebra. SIAM.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.
- Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801.
- Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. TechRxiv.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Kamradt, G. (2023). Needle In A Haystack - pressure testing LLMs. Github.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469.
- Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100.
- Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366.
- Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750.
- Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720.
- Meta, A. (2024). Introducing meta llama 3: The most capable openly available llm to date. Meta AI.
- OpenAI (2022). OpenAI: Introducing ChatGPT.
- OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542.
- Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13693–13696.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
- Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717.
- Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532.
- Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57.
- Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization. arXiv preprint arXiv:1905.06566.
- Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36.