Emergent Mind

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

(2401.15024)
Published Jan 26, 2024 in cs.LG and cs.CL

Abstract

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

Overview

  • The paper introduces SliceGPT, a method for reducing the size of LLMs post-training while maintaining performance.

  • SliceGPT uses an approach rooted in computational invariance and Principal Component Analysis (PCA) to 'slice' the model by removing less significant components.

  • Traditional sparsification methods like distillation and pruning often require Recovery Fine-Tuning (RFT), which is impractical for LLMs; SliceGPT avoids this necessity.

  • Experiments with OPT and LLAMA-2 models show that up to 30% of the model can be sliced away, yet preserving over 90% of zero-shot task performance.

  • SliceGPT models require no additional software optimization for deployment and can run effectively on consumer-grade hardware.

Introduction

The increasing reliance on LLMs in the realm of natural language processing has engendered a surge in computational and memory demands. Addressing this issue, the paper introduces SliceGPT, a novel approach to LLM sparsification post-training that preserves the bulk of the models' performance while considerably reducing their size.

Sparsification Strategies

Traditional sparsification methods adopt strategies such as distillation or pruning to reduce model sizes. Pruning techniques in particular have attracted attention for their ability to set certain weight matrix elements to zero, hoping to bypass some floating point operations and thereby accelerate computation. However, these methods have limitations, notably requiring Recovery Fine-Tuning (RFT) to maintain performance, which becomes impractical with LLMs due to their size. SliceGPT circumvents this by proposing a single post-training transformation based on the concept of computational invariance.

Computational Invariance and SliceGPT Methodology

At the heart of SliceGPT lies the idea of computational invariance within transformer networks. Through an elegant orthogonal transformation, the authors illustrate how certain operations within a transformer can be reordered without affecting the output. By applying this invariance, the authors devise a mechanism to project transformer signals onto their principal components using Principal Component Analysis (PCA), allowing the removal of less significant components—effectively "slicing" the network while keeping its predictive capabilities almost intact.

Experimental Insights and Findings

SliceGPT's efficacy is demonstrated through experiments on various LLMs, including OPT and LLAMA-2 models. The methodology shows that up to 30% of these models can be sliced while preserving more than 90% of their original zero-shot task performance. The sliced models not only require fewer computational resources but also maintain—or even surpass—perplexity when compared to dense models. Crucially, SliceGPT's sliced models require no additional software optimization to achieve these results, making them readily deployable on consumer-grade hardware.

Conclusion

SliceGPT advances the practical application of large-scale transformer models by mitigating resource constraints without sacrificing significant performance. The authors' findings hold substantial promise for future research in large-scale neural networks, providing a feasible path toward reducing inference costs and democratizing access to powerful NLP tools. The work also opens new avenues of research into other forms of LLM compression, such as structural pruning and quantization, while inviting further exploration into the realm of transformer network invariances.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube