Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SliceGPT: Compress Large Language Models by Deleting Rows and Columns (2401.15024v2)

Published 26 Jan 2024 in cs.LG and cs.CL

Abstract: LLMs have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Saleh Ashkboos (20 papers)
  2. Maximilian L. Croci (5 papers)
  3. Marcelo Gennari do Nascimento (2 papers)
  4. Torsten Hoefler (203 papers)
  5. James Hensman (46 papers)
Citations (94)

Summary

  • The paper introduces a post-training sparsification method using PCA to delete rows and columns while preserving over 90% of LLM performance.
  • It leverages computational invariance in transformer networks to remove up to 30% of parameters with a single transformation.
  • Experimental results on models like OPT and LLAMA-2 show reduced computational resources and maintained zero-shot effectiveness.

Introduction

The increasing reliance on LLMs in the field of natural language processing has engendered a surge in computational and memory demands. Addressing this issue, the paper introduces SliceGPT, a novel approach to LLM sparsification post-training that preserves the bulk of the models' performance while considerably reducing their size.

Sparsification Strategies

Traditional sparsification methods adopt strategies such as distillation or pruning to reduce model sizes. Pruning techniques in particular have attracted attention for their ability to set certain weight matrix elements to zero, hoping to bypass some floating point operations and thereby accelerate computation. However, these methods have limitations, notably requiring Recovery Fine-Tuning (RFT) to maintain performance, which becomes impractical with LLMs due to their size. SliceGPT circumvents this by proposing a single post-training transformation based on the concept of computational invariance.

Computational Invariance and SliceGPT Methodology

At the heart of SliceGPT lies the idea of computational invariance within transformer networks. Through an elegant orthogonal transformation, the authors illustrate how certain operations within a transformer can be reordered without affecting the output. By applying this invariance, the authors devise a mechanism to project transformer signals onto their principal components using Principal Component Analysis (PCA), allowing the removal of less significant components—effectively "slicing" the network while keeping its predictive capabilities almost intact.

Experimental Insights and Findings

SliceGPT's efficacy is demonstrated through experiments on various LLMs, including OPT and LLAMA-2 models. The methodology shows that up to 30% of these models can be sliced while preserving more than 90% of their original zero-shot task performance. The sliced models not only require fewer computational resources but also maintain—or even surpass—perplexity when compared to dense models. Crucially, SliceGPT's sliced models require no additional software optimization to achieve these results, making them readily deployable on consumer-grade hardware.

Conclusion

SliceGPT advances the practical application of large-scale transformer models by mitigating resource constraints without sacrificing significant performance. The authors' findings hold substantial promise for future research in large-scale neural networks, providing a feasible path toward reducing inference costs and democratizing access to powerful NLP tools. The work also opens new avenues of research into other forms of LLM compression, such as structural pruning and quantization, while inviting further exploration into the field of transformer network invariances.

Youtube Logo Streamline Icon: https://streamlinehq.com