GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM (2403.05527v4)
Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for LLMs inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.
- Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale.
- Language models are few-shot learners. Advances in neural information processing systems, 33 1877–1901.
- Language models are few-shot learners. CoRR, abs/2005.14165. https://arxiv.org/abs/2005.14165
- Training verifiers to solve math word problems.
- Flashattention: Fast and memory-efficient exact attention with io-awareness.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Gptq: Accurate post-training quantization for generative pre-trained transformers.
- Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance.
- Model tells you what to discard: Adaptive kv cache compression for llms.
- Measuring massive multitask language understanding.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Mistral 7b.
- Squeezellm: Dense-and-sparse quantization.
- Loftq: Lora-fine-tuning-aware quantization for large language models.
- Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=JZfg6wGi6g
- Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564.
- OpenAI (2023). Gpt-4 technical report.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada (H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox and R. Garnett, eds.).
- Efficiently scaling transformer inference.
- Sparq attention: Bandwidth-efficient llm inference.
- Flexgen: High-throughput generative inference of large language models with a single gpu.
- Challenging big-bench tasks and whether chain-of-thought can solve them.
- Lamda: Language models for dialog applications.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Attention is all you need. Advances in neural information processing systems, 30.
- Powersgd: Practical low-rank gradient compression for distributed optimization. CoRR, abs/1905.13727. http://arxiv.org/abs/1905.13727
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification. https://openreview.net/forum?id=yzkSU5zdwD
- Chain-of-thought prompting elicits reasoning in large language models.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Smoothquant: Accurate and efficient post-training quantization for large language models.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35 27168–27183.
- Wordcraft: Story writing with large language models. In 27th International Conference on Intelligent User Interfaces. IUI ’22, Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3490099.3511105
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS). IEEE. http://dx.doi.org/10.1109/EMC2-NIPS53020.2019.00016
- Opt: Open pre-trained transformer language models.
- H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models.
- Atom: Low-bit quantization for efficient and accurate llm serving.