Emergent Mind

Extreme Compression of Large Language Models via Additive Quantization

(2401.06118)
Published Jan 11, 2024 in cs.LG and cs.CL

Abstract

The emergence of accurate open LLMs has led to a race towards quantization techniques for such models enabling execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression--defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our work builds on top of Additive Quantization, a classic algorithm from the MCQ family, and adapts it to the quantization of language models. The resulting algorithm advances the state-of-the-art in LLM compression, outperforming all recently-proposed techniques in terms of accuracy at a given compression budget. For instance, when compressing Llama 2 models to 2 bits per parameter, our algorithm quantizes the 7B model to 6.93 perplexity (a 1.29 improvement relative to the best prior work, and 1.81 points from FP16), the 13B model to 5.70 perplexity (a .36 improvement) and the 70B model to 3.94 perplexity (a .22 improvement) on WikiText2. We release our implementation of Additive Quantization for Language Models AQLM as a baseline to facilitate future research in LLM quantization.

Overview

  • Introduces a novel approach for compressing LLMs using Additive Quantization (AQ) to maintain accuracy with smaller memory and computational requirements.

  • Describes a modified AQ method that minimizes error in language model outputs rather than weights, leading to high accuracy even with 2-bit parameter quantization.

  • Presents results where the Additive Quantization for Language Models (AQLM) outperforms traditional methods, especially at lower bit compression levels.

  • Demonstrates the effectiveness of AQLM using benchmarks that measure perplexity and zero-shot task accuracy.

  • Highlights AQLM's potential for deployment in resource-limited environments and its contribution to ongoing research in efficient LLM deployment.

Introduction

LLMs have seen significant advancement, attracting industrial and popular interest due to their precision and the potential for localized operation on user devices. Compression of these models is vital for their deployment on hardware with limited computation and memory resources. Quantization, the primary approach for post-training compression, aims to reduce the bit-width of model parameters and consequently improve the memory footprint and computational efficiency of the models. However, the quest for high compression often introduces a trade-off where extreme quantization leads to accuracy loss. This paper presents a novel approach to LLM compression utilizing Additive Quantization (AQ), advancing the state-of-the-art in maintaining accuracy under tight compression budgets.

Methodology

The paper details a modified version of AQ, a classic algorithm from the multi-codebook quantization (MCQ) family, adapted to compress LLM weights while preserving the functionality of the models. The new approach, named Additive Quantization for Language Models (AQLM), reformulates the standard AQ optimization problem to minimize the error in the LLM layer outputs rather than the weights themselves. By modifying the algorithm to be instance-aware and incorporating layer calibration, AQLM achieves a homogeneous and simple quantization format that maintains high accuracy even at extreme compression levels like 2 bits per parameter.

Results

In the results section, AQLM showcases superior performance when compressing and quantizing LLMs of various sizes, demonstrating significant improvement over existing methods across several bit compression ranges. The paper presents extensive evaluations using popular benchmarks such as the Llama 2 models, measuring both perplexity and zero-shot task accuracy. Notably, substantial improvements in perplexity are recorded, particularly at the extreme low-end of 2-bit parameter compression.

Conclusion

AQLM stands as a significant contribution to the field of LLM quantization, showing that it is possible to maintain high accuracy even at low bit counts. It serves as a critical step toward making complex LLMs accessible within a more extensive range of environments, especially those with limited resources. The release of its implementation further supports ongoing research and development, providing a foundation for future exploration into efficient LLM deployment on consumer-grade devices. Further work aims to streamline AQLM's computational process and delve deeper into optimal parameter settings for model compression.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit
AQLM potentially SOTA 2 bit quantisation (26 points, 5 comments) in /r/LocalLLaMA