SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models (2405.14917v2)

Published 23 May 2024 in cs.LG and cs.CL

Abstract: Post-training quantization (PTQ) is an effective technique for compressing LLMs. However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: \textbf{1)} \textit{Salience-Determined Bit Allocation} adaptively assigns bit-widths to groups within each layer based on their salience; and \textbf{2)} \textit{Salience-Weighted Quantizer Calibration} optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48\% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM$^+$, which incorporates gradient-based quantization, further reduces perplexity by 35.1\%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM

References (52)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces SliM-LLM, a salience-driven quantization method that allocates bit-widths based on weight importance via Salience-Determined Bit Allocation.
It employs Salience-Weighted Quantizer Calibration to optimize scale and zero-point settings, effectively preserving key model outputs.
Empirical results show the method reduces memory usage by 5.5× and perplexity by 48%, enabling efficient LLM deployment in resource-constrained environments.

An Analysis of SliM-LLM: Salience-Driven Mixed-Precision Quantization for Efficient LLM Deployment

The paper introduces SliM-LLM, a novel approach to efficient quantization of LLMs through salience-driven mixed-precision. As researchers continue to grapple with the computational demands and memory footprints of modern LLMs, effective post-training quantization (PTQ) methods remain essential to leverage these models in resource-constrained environments. This work addresses the shortcomings of existing PTQ methods, particularly at ultra-low bit-width levels.

Core Contributions and Techniques

The authors propose SliM-LLM, a quantization framework that optimizes the precision of model weights by considering the salience distribution—an assessment of weight importance based on their influence on model output. This approach introduces two primary innovations:

Salience-Determined Bit Allocation (SBA): SBA allocates bit-widths to quantization groups based on their salience, integrating a structured approach to maintain high-inference efficiency. This technique divides weights into groups and assigns bit-widths by analyzing the importance ranking of these groups, effectively minimizing output information loss quantified via Kullback-Leibler divergence rather than traditional mean square error metrics.
Salience-Weighted Quantizer Calibration (SQC): Focused on improving the quantizer's responsiveness to locally salient weights within groups, SQC utilizes search parameters to optimize the scale and zero-point configuration of the quantizer. This calibration is critical for maintaining a balance between preserving salient data and minimizing errors.

Numerical Results

The empirical evaluations show that SliM-LLM significantly outperforms existing methods like AWQ, GPTQ, and QuIP, especially in the 2-bit quantization scenario. For instance, the paper reports that a 2-bit quantization of LLaMA-7B achieves a 5.5-fold reduction in memory usage, significantly improving over existing gradient-free methods by reducing perplexity by 48%. The enhanced version, SliM-LLM+ which incorporates gradient-based quantizers, further reduces perplexity by 35.1%. Such results underscore the framework's proficiency in balancing the trade-off between accuracy and hardware efficiency.

Implications and Speculations on Future AI Development

From a practical standpoint, SliM-LLM enables the deployment of LLMs in environments with limited computational resources, thus broadening the applicability of such models to edge computing and real-time applications. The structured approach to mixed-precision quantization reduces the hardware burden traditionally associated with quantized models, paving the way for scalable and efficient AI solutions.

Theoretically, this work suggests that further exploration of salience-based techniques could extend beyond quantization to enhance various model compression strategies such as pruning or distillation. The emphasis on salience—where not all model parameters contribute equally to output—may influence broader model optimization paradigms.

Conclusion

In summary, SliM-LLM offers a promising advancement in PTQ, aiming to optimize the computational efficiency of LLMs while minimizing losses in accuracy. By leveraging the concept of salience, this framework addresses both the precision requirements and computational challenges in deploying LLMs in practical applications. Future research may build upon these insights, leading to even more sophisticated methodologies for AI model compression and deployment.

PDF Markdown