Emergent Mind

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

(2403.05527)
Published Mar 8, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Key-value (KV) caching has become the de-facto to accelerate generation speed for LLMs inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.

GEAR results show effectiveness of weight quantized model on GSM8k-CoT using LLaMA2-7B evaluation.

Overview

  • GEAR introduces a high-efficiency framework for near-lossless KV cache compression in LLMs, integrating uniform quantization, low-rank matrix approximation, and sparse matrix representation.

  • The framework outperforms existing compression methods by significantly reducing memory use and increasing system throughput without compromising the accuracy of generative tasks.

  • Empirical testing on models such as LLaMA2-7B and LLaMA2-13B across tasks like mathematical reasoning illustrated GEAR's superiority in maintaining accuracy and enhancing performance.

  • GEAR’s approach offers critical insights for future LLM developments, emphasizing its flexibility and broad applicability without needing specific hardware, benefitting systems with varying memory and computational constraints.

GEAR: Achieving High-Ratio Near-Lossless KV Cache Compression in Generative Inference for LLMs

In response to the growing demands on memory resources by LLMs during generative inference, significant efforts have been made to optimize KV (Key-Value) cache mechanisms to improve the efficiency of these models. Hao Kang et al. introduce GEAR, an efficient framework that addresses the challenge of memory-bound bottlenecks in LLM inference by enabling near-lossless high-ratio compression of KV caches. This approach integrates uniform quantization, low-rank matrix approximation, and sparse matrix representation, demonstrating substantial improvements in throughput and peak-memory reduction across various LLMs and tasks.

Overview of Existing Challenges

The paper begins with an exposition on the existing strategies for KV cache compression, namely token dropping and quantization, both aimed at reducing memory consumption and enhancing system throughput. However, these methods, though effective for simpler tasks, fall short in complex generative tasks due to high approximation errors that degrade performance.

GEAR Framework

The autors propose GEAR, a compression framework that uniquely combines three distinct yet complementary techniques:

  • Uniform quantization applied to majority of entries to reduce precision, thereby compressing the data.
  • Low-rank matrix approximation to efficiently represent the quantization residuals, capturing coherent information shared across tokens.
  • Sparse matrix representation to correct individual errors caused by outliers.

By adeptly integrating these techniques, GEAR successfully minimizes approximation errors and achieves near-lossless compression even at high compression ratios.

Empirical Validation

GEAR was tested across several benchmarks with models like LLaMA2-7B, LLaMA2-13B, and Mistral-7B on tasks encompassing mathematical reasoning, language understanding, and symbolic reasoning. The approach consistently outperformed baseline methods, maintaining near-baseline accuracy at compression ratios up to 3× and significantly improving system throughput and peak memory usage.

System Performance Implications

The efficacy of GEAR extends beyond accuracy maintenance to tangible system improvements. For systems with sufficient memory resources, GEAR's compression capabilities lower peak memory requirements, allowing for larger batch processing or longer sequence generation. In scenarios with constrained GPU resources necessitating offloading, GEAR enhances throughput significantly, showcasing its versatility across different system configurations.

Critical Insights and Future Prospects

Analysis reveals the nuanced importance of both K and V cache components, with a slightly greater sensitivity observed for K cache errors in tasks requiring sequential token generation. Furthermore, GEAR's utility is evident even when applied to models with pre-existing weight quantization, indicating its broad applicability.

The introduction of GEAR marks a significant stride towards resolving the memory bottleneck issue in LLM generative inference. Its near-lossless compression at substantial ratios, without necessitating specific hardware support, presents an attractive solution for enhancing LLM inference efficiency. As the field continues to evolve, the adaptability of frameworks like GEAR will be crucial for meeting the increasing computational demands of advanced AI applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.