Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 47 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 452 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models (2309.02784v2)

Published 6 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: As the size of LLMs continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032.
  2. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 7432–7439.
  3. Language models are few-shot learners. In Conference on Neural Information Processing Systems (NeurIPS).
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv preprint arXiv:2205.14135.
  6. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339.
  7. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360.
  8. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  10. A framework for few-shot language model evaluation.
  11. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks.
  12. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop.
  13. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR).
  14. The BigScience Corpus: A 1.6 TB Composite Multilingual Dataset.
  15. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR.
  16. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978.
  17. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888.
  18. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv:2305.11627.
  19. The Penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  20. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
  21. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  22. NVIDIA. 2023. FasterTransformer.
  23. OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774.
  24. OpenAI. 2023b. Introducing ChatGPT.
  25. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
  26. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1–67.
  27. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9): 99–106.
  28. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  30. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS).
  31. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099. PMLR.
  32. Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt. arXiv preprint arXiv:2305.11186.
  33. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv preprint arXiv:2206.01861.
  34. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv:2303.08302.
  35. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv:2304.01089.
  36. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  37. Root Mean Square Layer Normalization. arXiv:1910.07467.
  38. Lifting the Curse of Capacity Gap in Distilling Language Models. arXiv:2305.12129.
  39. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  40. ERNIE: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129.
Citations (24)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com