Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

QAQ: Quality Adaptive Quantization for LLM KV Cache (2403.04643v2)

Published 7 Mar 2024 in cs.CL

Abstract: The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Existing methods primarily rely on various hypotheses, such as sorting the KV cache based on attention scores for replacement or eviction, to compress the KV cache and improve model throughput. However, heuristics used by these strategies may wrongly evict essential KV cache, which can significantly degrade model performance. In this paper, we propose QAQ, a Quality Adaptive Quantization scheme for the KV cache. We theoretically demonstrate that key cache and value cache exhibit distinct sensitivities to quantization, leading to the formulation of separate quantization strategies for their non-uniform quantization. Through the integration of dedicated outlier handling, as well as an improved attention-aware approach, QAQ achieves up to 10x the compression ratio of the KV cache size with a neglectable impact on model performance. QAQ significantly reduces the practical hurdles of deploying LLMs, opening up new possibilities for longer-context applications. The code is available at github.com/ClubieDong/KVCacheQuantization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  2. Open LLM leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  3. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  4. Language models are few-shot learners, 2020.
  5. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  6. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  8. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  9. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
  10. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  11. Model tells you what to discard: Adaptive kv cache compression for LLMs. arXiv preprint arXiv:2310.01801, 2023.
  12. OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
  13. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  14. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. arXiv preprint arXiv:2305.17118, 2023.
  15. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  16. Howard Seltman. Approximations for mean and variance of a ratio. unpublished note, 2012.
  17. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  18. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  19. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  20. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  21. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTO: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
  22. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
Citations (20)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets