Extreme Compression of Large Language Models via Additive Quantization (2401.06118v4)

Published 11 Jan 2024 in cs.LG and cs.CL

Abstract: The emergence of accurate open LLMs has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression-defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter-from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.

References (52)

Citations (57)

View on Semantic Scholar

Summary

The paper demonstrates that reformulating additive quantization to optimize layer outputs allows extreme compression (down to 2 bits) with minimal accuracy loss.
It implements a modified multi-codebook quantization that is instance-aware and incorporates layer calibration to maintain model performance.
Evaluations on Llama 2 benchmarks show improved perplexity and zero-shot task accuracy compared to existing quantization methods.

Introduction

LLMs have seen significant advancement, attracting industrial and popular interest due to their precision and the potential for localized operation on user devices. Compression of these models is vital for their deployment on hardware with limited computation and memory resources. Quantization, the primary approach for post-training compression, aims to reduce the bit-width of model parameters and consequently improve the memory footprint and computational efficiency of the models. However, the quest for high compression often introduces a trade-off where extreme quantization leads to accuracy loss. This paper presents a novel approach to LLM compression utilizing Additive Quantization (AQ), advancing the state-of-the-art in maintaining accuracy under tight compression budgets.

Methodology

The paper details a modified version of AQ, a classic algorithm from the multi-codebook quantization (MCQ) family, adapted to compress LLM weights while preserving the functionality of the models. The new approach, named Additive Quantization for LLMs (AQLM), reformulates the standard AQ optimization problem to minimize the error in the LLM layer outputs rather than the weights themselves. By modifying the algorithm to be instance-aware and incorporating layer calibration, AQLM achieves a homogeneous and simple quantization format that maintains high accuracy even at extreme compression levels like 2 bits per parameter.

Results

In the results section, AQLM showcases superior performance when compressing and quantizing LLMs of various sizes, demonstrating significant improvement over existing methods across several bit compression ranges. The paper presents extensive evaluations using popular benchmarks such as the Llama 2 models, measuring both perplexity and zero-shot task accuracy. Notably, substantial improvements in perplexity are recorded, particularly at the extreme low-end of 2-bit parameter compression.

Conclusion

AQLM stands as a significant contribution to the field of LLM quantization, showing that it is possible to maintain high accuracy even at low bit counts. It serves as a critical step toward making complex LLMs accessible within a more extensive range of environments, especially those with limited resources. The release of its implementation further supports ongoing research and development, providing a foundation for future exploration into efficient LLM deployment on consumer-grade devices. Further work aims to streamline AQLM's computational process and explore optimal parameter settings for model compression.