Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 58 tok/s Pro
Kimi K2 201 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More (2402.12065v2)

Published 19 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.

Citations (25)

Summary

  • The paper introduces WKVQuant, a PTQ framework that quantizes both weights and the Key/Value cache to maintain accuracy while reducing memory usage.
  • It employs Past Only Quantization, two-dimensional quantization, and cross-block reconstruction regularization to effectively minimize quantization errors.
  • Experiments on LLaMA models reveal minimal increases in perplexity, confirming the method's balance between memory efficiency and performance.

WKVQuant: Quantizing Weight and Key/Value Cache for LLMs Gains More

Introduction

The paper "WKVQuant: Quantizing Weight and Key/Value Cache for LLMs Gains More" addresses the challenge of deploying LLMs by focusing on efficient quantization techniques. LLMs like GPT and LLaMA have demonstrated exceptional performance in natural language processing tasks; however, their large memory and computational requirements make deployment difficult. The paper proposes a new quantization framework, WKVQuant, designed specifically for quantizing both weights and Key/Value (KV) caches in LLMs, overcoming the limitations of existing methods that either focus solely on weights or include activations at the cost of accuracy.

Quantization Challenges in LLMs

LLMs consist of massive model weights and utilize a KV cache to store intermediate activations for auto-regressive generation. Quantization, which reduces memory usage by converting parameters into low-bit integers, is a common approach to address the memory demand. Existing methods include weight-only quantization, which preserves accuracy but fails to sufficiently reduce memory, and weight-activation quantization, which impacts accuracy due to quantizing both weights and activations. The latter includes crucial activations like the KV cache, but introduces more quantization error which significantly degrades performance. Figure 1

Figure 1: Memory-Performance curve.

WKVQuant Framework

The WKVQuant framework is a Post-Training Quantization (PTQ) method targeting the quantization of weights and the KV cache alone. This smaller focus allows the framework to retain higher accuracy compared to methods that attempt to quantize all activations, including temporary ones. WKVQuant introduces three key techniques:

  1. Past Only Quantization (POQ): POQ quantizes only the past KV cache during decoding, while maintaining the current KV in full precision, thus improving the accuracy of attention computations. Figure 2

    Figure 2: Comparison between original quantization method and Past Only Quantization (POQ) for KV cache.

  2. Two-dimensional Quantization: This strategy employs static channel smoothing and dynamic token-wise quantization to handle distributional variance within the KV cache, thereby reducing quantization errors effectively.
  3. Cross-block Reconstruction Regularization (CRR): CRR optimizes quantization parameters by evaluating the impact of quantization across blocks, using Mean Absolute Error (MAE) instead of Mean Squared Error (MSE) to minimize bias from outliers. Figure 3

    Figure 3: Comparison between Block-wise Mean Square Error (MSE) loss and CRR.

Experimental Evaluation

Experiments on the LLaMA and LLaMA-2 models demonstrated that WKVQuant achieves memory savings comparable to full weight-activation quantization while preserving accuracy close to weight-only quantization. Specifically, the WKVQuant method shows minimal performance drops in terms of perplexity on datasets like WikiText2 and PTB when compared to existing techniques, proving it as an efficient alternative for model deployment under memory constraints. Figure 4

Figure 4: The memory usage proportions of weight, KV cache, and temporary storage on LLaMA-2-13b.

Conclusion

WKVQuant provides a valuable trade-off between accuracy and memory efficiency, achieved by focusing on the quantization of model weights and KV cache without involving temporary activations. This approach not only facilitates the deployment of LLMs on resource-constrained hardware but also ensures minimal performance degradation, making it a promising strategy for efficiently managing the memory footprint of large models. Future work could explore the integration of more sophisticated quantization techniques within this framework to further enhance its applicability across various LLM architectures.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.