Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models (2405.06219v3)

Published 10 May 2024 in cs.LG and cs.CL

Abstract: LLMs can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context length increasing, becoming the bottleneck for deployment. In this paper, we present a strategy called SKVQ, which stands for sliding-window KV cache quantization, to address the issue of extremely low bitwidth KV cache quantization. To achieve this, SKVQ rearranges the channels of the KV cache in order to improve the similarity of channels in quantization groups, and applies clipped dynamic quantization at the group level. Additionally, SKVQ ensures that the most recent window tokens in the KV cache are preserved with high precision. This helps maintain the accuracy of a small but important portion of the KV cache.SKVQ achieves high compression ratios while maintaining accuracy. Our evaluation on LLMs demonstrates that SKVQ surpasses previous quantization approaches, allowing for quantization of the KV cache to 2-bit keys and 1.5-bit values with minimal loss of accuracy. With SKVQ, it is possible to process context lengths of up to 1M on an 80GB memory GPU for a 7b model and up to 7 times faster decoding.

Citations (10)

Summary

  • The paper presents SKVQ, a method that uses sliding-window and channel reordering to enable low-bitwidth quantization of KV caches for LLMs.
  • It employs clipped dynamic quantization with channel reordering to minimize errors and maintain full precision for critical, recent tokens.
  • Experiments on LongBench benchmark show SKVQ outperforms existing methods like KIVI, achieving effective 2-bit key and 1.5-bit value compression.

SKVQ: Sliding-window Key and Value Cache Quantization for LLMs

Introduction

The ability of LLMs to process longer token sequences introduces challenges related to the efficient management of Key-Value (KV) caches, which are essential to LLM inference operations. Traditional KV cache handling methods prove inadequate as they consume significant memory, especially with extended context lengths. This paper introduces SKVQ, a method designed to achieve low-bitwidth KV cache quantization to mitigate memory bottlenecks while maintaining model accuracy and efficiency.

Method

Clipped Dynamic Quantization with Channel Reorder

SKVQ leverages a novel approach to mitigate quantization errors by grouping similar channels through channel reordering, thereby enhancing quantization accuracy:

  • Channel Reorder: Channels within the KV cache that exhibit similar statistical properties are grouped together. This grouping enables more accurate quantization by minimizing the range variance within each group. The reordering is performed such that operations remain mathematically equivalent.
  • Clipped Dynamic Quantization: Further enhances quantization by clipping outlier values within a group. A clipping scale α\alpha is used to dynamically adjust quantization boundaries, reducing quantization artifacts especially at lower bit-widths. Figure 1

    Figure 1: Visualization of the key cache going through channel reorder and group clipping in sequence. The elements in the red/green box will be placed in the same group to share the quantization parameters.

Sliding Window Quantization Strategy

To address the cumulative quantization errors typical in long-context tasks, the sliding window strategy preserves a small, recent segment of the KV cache in full precision:

  • Locality Exploitation: By maintaining a sliding window of the most recent tokens in full precision, this strategy capitalizes on the locality of attention within transformer architectures.
  • Important KV Cache Filter: Involves identifying and maintaining tokens critical to model inference accuracy, including recent tokens and certain important tokens that are sensitive to quantization. Figure 2

    Figure 2: Overview of sliding window quantization Strategy. In each time step, we ensure the latest ww KV cache is full precision. For a token cache that slides out of the window, we make a decision based on the filter rules and choose whether to retain it to high precision.

Results

Performance and Evaluation

SKVQ was evaluated across multiple tasks on the LongBench benchmark using models such as LLaMA and Mistral families. It consistently outperformed existing quantization methods like KIVI and SmoothQuant, particularly in scenarios involving long-context processing where traditional quantization methods suffer from significant accuracy losses. Figure 3

Figure 3: Comparison of SKVQ with KIVI on needle in haystack test. SKVQ achieved higher scores while using lower bitwidth.

  • Low Bitwidth Success: SKVQ demonstrated the capability to quantize KV caches to 2 bits for keys and 1.5 bits for values without appreciable degradation in performance across various tasks. Figure 4

    Figure 4: Ablation Study: Average score of Mistral-7b-Instruct-v0.2 on LongBench under different window sizes. Quantization setting: KV cache 2bits with group size 128.

Conclusion

SKVQ presents a advancements in the quantization of KV caches for large context tasks addressed by LLMs. By implementing a sliding window combined with channel-specific handling through dynamic quantization and reordering, SKVQ significantly improves memory usage and processing efficiency without sacrificing accuracy. Future work could explore optimizing filter rules further and enhancing integration into existing inference systems, advancing the potential of LLMs to process even longer contexts efficiently. Figure 5

Figure 5: Comparison of SKVQ with KIVI on 32k context length needle in haystack test. The baseline score is 268.5. We vary the group size from 64 to 128, and vary the quantization bits from (key 2bits, value 2bits) to (key 2bits, value 1.5bits).

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 51 likes.

Upgrade to Pro to view all of the tweets about this paper: