Emergent Mind

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

(2401.14109)
Published Jan 25, 2024 in cs.CL , cs.AI , cs.LG , and quant-ph

Abstract

LLMs such as ChatGPT and LlaMA are advancing rapidly in generative AI, but their immense size poses significant challenges, such as huge training and inference costs, substantial energy demands, and limitations for on-site deployment. Traditional compression methods such as pruning, distillation, and low-rank approximation focus on reducing the effective number of neurons in the network, while quantization focuses on reducing the numerical precision of individual weights to reduce the model size while keeping the number of neurons fixed. While these compression methods have been relatively successful in practice, there is no compelling reason to believe that truncating the number of neurons is an optimal strategy. In this context, this paper introduces CompactifAI, an innovative LLM compression approach using quantum-inspired Tensor Networks that focuses on the model's correlation space instead, allowing for a more controlled, refined and interpretable model compression. Our method is versatile and can be implemented with - or on top of - other compression techniques. As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% the memory size of LlaMA 7B, reducing also 70% the number of parameters, accelerating 50% the training and 25% the inference times of the model, and just with a small accuracy drop of 2% - 3%, going much beyond of what is achievable today by other compression techniques. Our methods also allow to perform a refined layer sensitivity profiling, showing that deeper layers tend to be more suitable for tensor network compression, which is compatible with recent observations on the ineffectiveness of those layers for LLM performance. Our results imply that standard LLMs are, in fact, heavily overparametrized, and do not need to be large at all.

Overview

  • CompactifAI introduces a novel quantum-inspired Tensor Network method for compressing LLMs while retaining high accuracy.

  • Traditional compression techniques like pruning and quantization are augmented with CompactifAI to improve model efficiency.

  • Tensor Networks decompose weight matrices in neural networks, and controlled truncation of correlations allows significant size reduction.

  • The methodology underwent benchmarking with LlaMA-2 7B model reducing its size to 30% in float16 while maintaining nearly 90% accuracy post retraining.

  • The method offers energy-efficient and accessible LLMs, suitable for on-premises deployment and could democratize AI technologies.

Methodology

The focus of the study is on CompactifAI, a method considered novel due to its use of quantum-inspired Tensor Networks (TNs) for compressing LLMs. The paper outlines an innovative technique for compression that diverges from traditional methods such as pruning, distillation, quantization, and low-rank approximations, which typically truncate the number of effective neurons or reduce the numerical precision of weights.

The CompactifAI approach targets the correlation space within the model, favoring a more nuanced and controlled compression strategy. Versatile by design, it can augment existing compression techniques to drive further model efficiency. The authors demonstrate that even after massive compression, the model retains over 90% of its initial accuracy with a brief period of distributed retraining.

Implementation

The authors describe a compression pipeline where weight matrices in neural networks are decomposed into Tensor Networks like Matrix Product Operators (MPOs). Truncating the correlations in the LLMs' layers, specifically in self-attention and multi-layer perceptron layers, is enabled through controlling the bond dimension of the TN. A crucial benefit of this method is its efficiency with significantly diminished energy and memory requirements. The highlighted LlaMA models have their weight matrices duly reshaped and decomposed, with a substantial parameter count reduction. Subsequently, the paper explains retraining the tensorized model using distributed training enables near-original accuracy of the compressed version, emphasizing its suitability for LLM fine-tuning.

Results

Benchmarking the CompactifAI methodology involved the LlaMA-2 7B model, part of META’s LlaMA series. The authors used quantization to halve the memory requirement from float32 to float16, followed by a Tensor Network compression that reduced the model to 30% of its size in float16. Noteworthy is the fact that following additional retraining on text summarization tasks using the XSum and Gigaword datasets, the compressed model achieved nearly 90% of the accuracy of the original model.

Conclusions & Prospects

The CompactifAI method presents a significant advancement in creating energy-efficient and more accessible LLMs. It allows for profound reductions in model size with minimal accuracy loss, offering a more sophisticated alternative to existing compression techniques. This work potentially paves the path for on-premises deployment of LLMs, expanding application fields to areas not reliant on cloud connectivity. The compatibility with other compression methods further strengthens the case for CompactifAI as a versatile and potent tool in AI development, potentially driving the democratization of AI technologies and mitigating their environmental footprint.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube