AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (2306.00978v5)

Published 1 Jun 2023 in cs.CL

Abstract: LLMs have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

References (49)

Citations (299)

View on Semantic Scholar

Summary

The paper introduces AWQ, which uses activation-aware scaling to protect critical weights and reduce quantization errors, especially in low-bit settings.
It leverages per-channel scaling without mixed-precision, allowing efficient LLM compression while avoiding extensive retraining.
Experiments with TinyChat demonstrate up to 3.9x speed acceleration on edge devices, showcasing its practical deployment benefits.

Activation-aware Weight Quantization for LLM Compression and Acceleration

Introduction

The paper "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" addresses the scaling challenges associated with deploying LLMs on edge devices. By proposing a method called Activation-aware Weight Quantization (AWQ), the authors aim to mitigate the significant memory and computational demands of LLMs while preserving their performance characteristics. This method emphasizes the importance of selectively protecting salient weights during quantization without relying on backpropagation or substantial retraining.

Methodology

The approach centers on the observation that only a small fraction of weights in LLMs is critical for maintaining performance. AWQ utilizes this insight by adopting an activation-aware approach to quantization, where it evaluates each weight's significance based on activation distributions. This activation-aware assessment enables the application of per-channel scaling strategies to protect these salient weights and reduce quantization errors effectively.

AWQ avoids mixed-precision formats, which are generally hardware-inefficient, by focusing on preserving crucial weights through a simple scaling mechanism. This method integrates easily with existing hardware without the need for complex mixed-precision arithmetic, thus offering a straightforward path to implementing efficient quantized LLMs.

Figure 1: Introduction of AWQ with implementation in TinyChat for deploying 4-bit quantized LLMs, achieving significant performance boosts.

Implementation with TinyChat

The paper introduces TinyChat, an inference framework tailored for deploying models quantized using the AWQ method. TinyChat efficiently transforms theoretical memory savings into tangible computational speedups by leveraging device-specific optimizations such as SIMD-aware weight packing and kernel fusion.

TinyChat surpasses existing systems like Huggingface's FP16 implementation, offering a substantial speed increase. By achieving up to 3.9 times acceleration on devices like NVIDIA's RTX 4090 and Jetson Orin, TinyChat exemplifies the implementation of AWQ in real-world scenarios, demonstrating compatibility across various platforms and facilitating LLM deployment on constrained devices.

Experiments and Results

The experimental results substantiate AWQ's effectiveness across multiple benchmarks and model architectures, including language modeling tasks and instruction-tuned models. The paper illustrates that AWQ achieves lower perplexity compared to baseline methods such as round-to-nearest quantization (RTN) and GPTQ, particularly under extreme low-bit settings.

Figure 2: Identification and protection of 1% salient weights to improve quantized performance, following the activation-awareness principle.

Additionally, AWQ showcases its robustness with respect to calibration data size and distribution, requiring smaller calibration sets while still delivering favorable performance outcomes across diverse domains. This robustness underscores AWQ's potential as a reliable quantization method for a broad range of applications.

Implications and Future Work

The implications of AWQ are significant for on-device AI applications requiring efficient deployment of LLMs. The reduction in memory requirements and acceleration in inference time facilitated by AWQ and TinyChat open new avenues for privacy-preserving applications and those needing real-time performance.

Future work could expand upon the integration of AWQ into further edge device architectures, optimizing use with additional quantization techniques, and exploring its application to even more diverse model types, including multi-modal systems. Additionally, further refinement in automated scaling strategies could enhance the flexibility and performance across different hardware environments.

Conclusion

By presenting Activation-aware Weight Quantization, the authors provide a viable solution to the pressing issue of deploying large LLMs on resource-constrained devices. Through strategic quantization that maintains critical performance features and a versatile deployment system in TinyChat, the paper sets a foundation for future developments in the efficient and practical deployment of AI models at the edge.