Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 118 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (2306.00978v5)

Published 1 Jun 2023 in cs.CL

Abstract: LLMs have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Openflamingo, March 2023.
  3. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  7. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  9. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  10. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  11. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  12. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
  13. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  14. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  15. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  17. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
  18. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  19. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
  20. The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15, pages 217–226. Springer, 2004.
  21. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  23. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
  24. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  25. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722, 2020.
  26. Visual instruction tuning. 2023.
  27. Pointer sentinel mixture models, 2016.
  28. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
  29. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
  30. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  33. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  34. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  35. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  37. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In CVPR, 2019.
  42. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  43. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
  44. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022.
  45. Outlier suppression: Pushing the limit of low-bit transformer language models, 2022.
  46. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  47. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.
  48. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  49. Opt: Open pre-trained transformer language models, 2022.
Citations (299)

Summary

  • The paper introduces AWQ, which uses activation-aware scaling to protect critical weights and reduce quantization errors, especially in low-bit settings.
  • It leverages per-channel scaling without mixed-precision, allowing efficient LLM compression while avoiding extensive retraining.
  • Experiments with TinyChat demonstrate up to 3.9x speed acceleration on edge devices, showcasing its practical deployment benefits.

Activation-aware Weight Quantization for LLM Compression and Acceleration

Introduction

The paper "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" addresses the scaling challenges associated with deploying LLMs on edge devices. By proposing a method called Activation-aware Weight Quantization (AWQ), the authors aim to mitigate the significant memory and computational demands of LLMs while preserving their performance characteristics. This method emphasizes the importance of selectively protecting salient weights during quantization without relying on backpropagation or substantial retraining.

Methodology

The approach centers on the observation that only a small fraction of weights in LLMs is critical for maintaining performance. AWQ utilizes this insight by adopting an activation-aware approach to quantization, where it evaluates each weight's significance based on activation distributions. This activation-aware assessment enables the application of per-channel scaling strategies to protect these salient weights and reduce quantization errors effectively.

AWQ avoids mixed-precision formats, which are generally hardware-inefficient, by focusing on preserving crucial weights through a simple scaling mechanism. This method integrates easily with existing hardware without the need for complex mixed-precision arithmetic, thus offering a straightforward path to implementing efficient quantized LLMs. Figure 1

Figure 1: Introduction of AWQ with implementation in TinyChat for deploying 4-bit quantized LLMs, achieving significant performance boosts.

Implementation with TinyChat

The paper introduces TinyChat, an inference framework tailored for deploying models quantized using the AWQ method. TinyChat efficiently transforms theoretical memory savings into tangible computational speedups by leveraging device-specific optimizations such as SIMD-aware weight packing and kernel fusion.

TinyChat surpasses existing systems like Huggingface's FP16 implementation, offering a substantial speed increase. By achieving up to 3.9 times acceleration on devices like NVIDIA's RTX 4090 and Jetson Orin, TinyChat exemplifies the implementation of AWQ in real-world scenarios, demonstrating compatibility across various platforms and facilitating LLM deployment on constrained devices.

Experiments and Results

The experimental results substantiate AWQ's effectiveness across multiple benchmarks and model architectures, including language modeling tasks and instruction-tuned models. The paper illustrates that AWQ achieves lower perplexity compared to baseline methods such as round-to-nearest quantization (RTN) and GPTQ, particularly under extreme low-bit settings. Figure 2

Figure 2: Identification and protection of 1% salient weights to improve quantized performance, following the activation-awareness principle.

Additionally, AWQ showcases its robustness with respect to calibration data size and distribution, requiring smaller calibration sets while still delivering favorable performance outcomes across diverse domains. This robustness underscores AWQ's potential as a reliable quantization method for a broad range of applications.

Implications and Future Work

The implications of AWQ are significant for on-device AI applications requiring efficient deployment of LLMs. The reduction in memory requirements and acceleration in inference time facilitated by AWQ and TinyChat open new avenues for privacy-preserving applications and those needing real-time performance.

Future work could expand upon the integration of AWQ into further edge device architectures, optimizing use with additional quantization techniques, and exploring its application to even more diverse model types, including multi-modal systems. Additionally, further refinement in automated scaling strategies could enhance the flexibility and performance across different hardware environments.

Conclusion

By presenting Activation-aware Weight Quantization, the authors provide a viable solution to the pressing issue of deploying large LLMs on resource-constrained devices. Through strategic quantization that maintains critical performance features and a versatile deployment system in TinyChat, the paper sets a foundation for future developments in the efficient and practical deployment of AI models at the edge.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 33 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com