CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks (2401.14109v2)

Published 25 Jan 2024 in cs.CL, cs.AI, cs.LG, and quant-ph

Abstract: LLMs such as ChatGPT and LlaMA are advancing rapidly in generative AI, but their immense size poses significant challenges, such as huge training and inference costs, substantial energy demands, and limitations for on-site deployment. Traditional compression methods such as pruning, distillation, and low-rank approximation focus on reducing the effective number of neurons in the network, while quantization focuses on reducing the numerical precision of individual weights to reduce the model size while keeping the number of neurons fixed. While these compression methods have been relatively successful in practice, there is no compelling reason to believe that truncating the number of neurons is an optimal strategy. In this context, this paper introduces CompactifAI, an innovative LLM compression approach using quantum-inspired Tensor Networks that focuses on the model's correlation space instead, allowing for a more controlled, refined and interpretable model compression. Our method is versatile and can be implemented with - or on top of - other compression techniques. As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% the memory size of LlaMA 7B, reducing also 70% the number of parameters, accelerating 50% the training and 25% the inference times of the model, and just with a small accuracy drop of 2% - 3%, going much beyond of what is achievable today by other compression techniques. Our methods also allow to perform a refined layer sensitivity profiling, showing that deeper layers tend to be more suitable for tensor network compression, which is compatible with recent observations on the ineffectiveness of those layers for LLM performance. Our results imply that standard LLMs are, in fact, heavily overparametrized, and do not need to be large at all.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces CompactifAI, a quantum-inspired tensor network method that compresses large language models while maintaining over 90% accuracy.
It employs MPO-based decomposition of weight matrices, reducing parameters by up to 70% and minimizing energy and memory requirements.
Results on the LlaMA-2 7B model show that distributed retraining on text summarization tasks nearly restores original performance.

Methodology

The focus of the paper is on CompactifAI, a method considered novel due to its use of quantum-inspired Tensor Networks (TNs) for compressing LLMs. The paper outlines an innovative technique for compression that diverges from traditional methods such as pruning, distillation, quantization, and low-rank approximations, which typically truncate the number of effective neurons or reduce the numerical precision of weights.

The CompactifAI approach targets the correlation space within the model, favoring a more nuanced and controlled compression strategy. Versatile by design, it can augment existing compression techniques to drive further model efficiency. The authors demonstrate that even after massive compression, the model retains over 90% of its initial accuracy with a brief period of distributed retraining.

Implementation

The authors describe a compression pipeline where weight matrices in neural networks are decomposed into Tensor Networks like Matrix Product Operators (MPOs). Truncating the correlations in the LLMs' layers, specifically in self-attention and multi-layer perceptron layers, is enabled through controlling the bond dimension of the TN. A crucial benefit of this method is its efficiency with significantly diminished energy and memory requirements. The highlighted LlaMA models have their weight matrices duly reshaped and decomposed, with a substantial parameter count reduction. Subsequently, the paper explains retraining the tensorized model using distributed training enables near-original accuracy of the compressed version, emphasizing its suitability for LLM fine-tuning.

Results

Benchmarking the CompactifAI methodology involved the LlaMA-2 7B model, part of META’s LlaMA series. The authors used quantization to halve the memory requirement from float32 to float16, followed by a Tensor Network compression that reduced the model to 30% of its size in float16. Noteworthy is the fact that following additional retraining on text summarization tasks using the XSum and Gigaword datasets, the compressed model achieved nearly 90% of the accuracy of the original model.

Conclusions & Prospects

The CompactifAI method presents a significant advancement in creating energy-efficient and more accessible LLMs. It allows for profound reductions in model size with minimal accuracy loss, offering a more sophisticated alternative to existing compression techniques. This work potentially paves the path for on-premises deployment of LLMs, expanding application fields to areas not reliant on cloud connectivity. The compatibility with other compression methods further strengthens the case for CompactifAI as a versatile and potent tool in AI development, potentially driving the democratization of AI technologies and mitigating their environmental footprint.

PDF Markdown

Related Papers

Tweets

https://twitter.com/OrusRoman/status/1750773541220352189

https://twitter.com/OrusRoman/status/1790300448865317144

https://twitter.com/MultiverseQC/status/1750871491330138512

https://twitter.com/woojinrad/status/1752718883826360449

https://twitter.com/QuantumPapers/status/1750802061132779900

https://twitter.com/gm8xx8/status/1750705863524815277

YouTube

Show All Videos