Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models (2402.14866v2)

Published 21 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Miguel A Carreira-Perpinán and Yerlan Idelbayev. 2018. “learning-compression” algorithms for neural net pruning. In IEEE CVPR. 8532–8541.
  2. Progressive darts: Bridging the optimization gap for nas in the wild. IJCV 129 (2021), 638–655.
  3. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. NIPS 33 (2020), 18518–18529.
  4. GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. ICLR (2023).
  5. Optimal brain compression: A framework for accurate post-training quantization and pruning. NeurIPS 35 (2022), 4475–4488.
  6. EleutherAI/lm-evaluation-harness: v0.3.0. https://doi.org/10.5281/zenodo.7413426
  7. SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
  8. Optimal brain damage. NeurIPS 2 (1989).
  9. OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272 (2023).
  10. LLM-FP4: 4-Bit Floating-Point Quantized Transformers. arXiv preprint arXiv:2310.16836 (2023).
  11. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888 (2023).
  12. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).
  13. Training language models to follow instructions with human feedback. NeurIPS 35 (2022), 27730–27744.
  14. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 1 (2020), 5485–5551.
  15. PB-LLM: Partially Binarized Large Language Models. arXiv preprint arXiv:2310.00034 (2023).
  16. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  17. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML. 38087–38099.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets