Layer-Condensed KV Cache for Efficient Inference of Large Language Models (2405.10637v2)
Abstract: Huge memory consumption has been a major bottleneck for deploying high-throughput LLMs in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep LLMs. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on LLMs show that our method achieves up to 26$\times$ higher throughput than standard transformers and competitive performance in LLMing and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.
- Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. arXiv preprint arXiv:2403.09054.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
- A framework for few-shot language model evaluation.
- Model tells you what to discard: Adaptive KV cache compression for LLMs. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023).
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
- LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore. Association for Computational Linguistics.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
- Jean Kaddour. 2023. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore. Association for Computational Linguistics.
- Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Thirty-seventh Conference on Neural Information Processing Systems.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Pointer sentinel mixture models. In International Conference on Learning Representations.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- Learning to compress prompts with gist tokens. In Thirty-seventh Conference on Neural Information Processing Systems.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.
- Subformer: Exploring weight sharing for parameter efficiency in generative transformers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Context compression for auto-regressive transformers with sentinel tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12860–12867, Singapore. Association for Computational Linguistics.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- FlexGen: High-throughput generative inference of large language models with a single GPU. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31094–31116. PMLR.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Haoyi Wu and Kewei Tu. 2023. Probabilistic transformer: A probabilistic dependency model for contextual word representation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7613–7636, Toronto, Canada. Association for Computational Linguistics.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. In Thirty-seventh Conference on Neural Information Processing Systems.