Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation (2403.06408v1)

Published 11 Mar 2024 in cs.LG and cs.AI

Abstract: Quantization has emerged as a promising technique for improving the memory and computational efficiency of LLMs. Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this approach "the lens of perturbation". Using this lens, we conduct experiments with various artificial perturbations to explore their impact on LLM performance. Our findings reveal several connections between the properties of perturbations and LLM performance, providing insights into the failure cases of uniform quantization and suggesting potential solutions to improve the robustness of LLM quantization. To demonstrate the significance of our findings, we implement a simple non-uniform quantization approach based on our insights. Our experiments show that this approach achieves minimal performance degradation on both 4-bit weight quantization and 8-bit quantization for weights and activations. These results validate the correctness of our approach and highlight its potential to improve the efficiency of LLMs without sacrificing performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Understanding and Overcoming the Challenges of Efficient Transformer Quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7947–7969. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  2. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. arXiv preprint arXiv:2306.12929.
  3. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  4. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, 7750–7774. PMLR.
  5. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  6. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630.
  7. PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, 8065–8079. Toronto, Canada: Association for Computational Linguistics.
  8. Measuring Massive Multitask Language Understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  9. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1): 6869–6898.
  10. I-BERT: Integer-only BERT Quantization. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5506–5518. PMLR.
  11. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978.
  12. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888.
  13. Pointer Sentinel Mixture Models. arXiv:1609.07843.
  14. OpenAI. 2022. OpenAI chatgpt. https://openai.com/blog/chatgpt. Accessed: 2024-02-28.
  15. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1525–1534. Berlin, Germany: Association for Computational Linguistics.
  16. Outlier Dimensions that Disrupt Transformers are Driven by Frequency. In Findings of the Association for Computational Linguistics: EMNLP 2022, 1286–1304.
  17. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints.
  18. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  19. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8815–8821.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  21. ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. arXiv preprint arXiv:2307.09782.
  22. Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv preprint arXiv:2206.01859.
  23. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099. PMLR.
  24. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv:2303.08302.
  25. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets