LoMA: Lossless Compressed Memory Attention (2401.09486v2)
Abstract: LLMs face limitations due to the high demand on GPU memory and computational resources when handling long contexts. While sparsify the Key-Value (KV) cache of transformer model is a typical strategy to alleviate resource usage, it unavoidably results in the loss of information. We introduce Lossless Compressed Memory Attention (LoMA), a novel approach that enables lossless compression of the KV cache, thereby reducing the memory and computational demands during autoregressive generation. LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context. Our method compresses the KV cache after every $tc$ generated tokens with a compression ratio of $c$ and a target compressed length $t$, and this process occurs within a single inference pass without dependency on auxiliary models. We engineered an efficient training scheme involving specific inputs, attention masks, and position identifiers to instill this compression capability. Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage through achieving lossless KV cache compression.
- Training Verifiers to Solve Math Word Problems, November 2021. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168 [cs].
- Memory-efficient Transformers via Top-$k$ Attention, June 2021. URL http://arxiv.org/abs/2106.06899. arXiv:2106.06899 [cs].
- Mistral 7B, October 2023a. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, December 2023b. URL http://arxiv.org/abs/2310.05736. arXiv:2310.05736 [cs].
- Learning to Reason and Memorize with Self-Notes, October 2023. URL http://arxiv.org/abs/2305.00833. arXiv:2305.00833 [cs].
- Learning to Compress Prompts with Gist Tokens, July 2023. URL http://arxiv.org/abs/2304.08467. arXiv:2304.08467 [cs].
- An Introduction to Convolutional Neural Networks, December 2015. URL http://arxiv.org/abs/1511.08458. arXiv:1511.08458 [cs].
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. ISSN 1533-7928. URL http://jmlr.org/papers/v21/20-074.html.
- SparQ Attention: Bandwidth-Efficient LLM Inference, December 2023. URL http://arxiv.org/abs/2312.04985. arXiv:2312.04985 [cs].
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, March 2020. URL http://arxiv.org/abs/1909.08053. arXiv:1909.08053 [cs].
- Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
- Baichuan 2: Open Large-scale Language Models, September 2023. URL http://arxiv.org/abs/2309.10305. arXiv:2309.10305 [cs].
- Big Bird: Transformers for Longer Sequences, January 2021. URL http://arxiv.org/abs/2007.14062. arXiv:2007.14062 [cs, stat].
- H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, July 2023. URL http://arxiv.org/abs/2306.14048. arXiv:2306.14048 [cs].
- Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection, December 2019. URL http://arxiv.org/abs/1912.11637. arXiv:1912.11637 [cs].