Emergent Mind

A Comprehensive Survey of Compression Algorithms for Language Models

(2401.15347)
Published Jan 27, 2024 in cs.CL and cs.AI

Abstract

How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the gigantic size of language models, such as increased carbon emissions and expensive maintenance fees. While numerous compression algorithms have shown remarkable progress in compressing language models, it ironically becomes challenging to capture emerging trends and identify the fundamental concepts underlying them due to the excessive number of algorithms. In this paper, we survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. We not only summarize the overall trend of diverse compression algorithms but also select representative algorithms and provide in-depth analyses of them. We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of LLMs. Finally, we introduce promising future research topics based on our survey results.

Overview

  • The paper provides a detailed overview of algorithms designed to compress language models effectively.

  • It discusses various compression techniques including pruning, quantization, knowledge distillation, and others, evaluating their performance and impact on model accuracy.

  • SparseGPT, OPTQ, and Low-Rank Adaptation (LoRA) are highlighted as notable algorithms for pruning, quantization, and fine-tuning, respectively.

  • Successful compression algorithms should directly incorporate task-specific objective functions and utilize an iterative compression process.

  • The paper suggests that combining different compression techniques could lead to significant compression rates for LLMs and facilitate wider accessibility.

Overview

The landscape of language model (LM) compression is vast, with an array of algorithms vying to reduce the size and computational demands of these models without impinging on their accuracy. This paper presents a comprehensive overview of such algorithms, including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. The analysis here explore the intricacies of each approach, evaluates their performance, and compares their effectiveness. Discerning the nuances between high-cost and low-cost approaches, the paper also underscores critical attributes that successful LM compression algorithms should possess.

Representative Compression Algorithms

Among the several algorithms surveyed, a few stand out for their contribution to the field. SparseGPT makes significant strides in pruning methodologies, successfully handling LLMs and extending its pruning technique to accommodate semi-structured sparsity patterns like 2:4 sparsity. The algorithm optimizes weight pruning using optimal brain damage (OBD)-derived strategies and notably curtails the computational demands of the Hessian inversion.

In quantization, OPTQ emerges as a potent tool for compressing the colossal parameter matrices of LLMs. The key to OPTQ's success lies in its round-to-nearest approximation that mitigates degradation in precision by optimizing weight adjustments post-quantization. Its strength is further bolstered by improvements from subsequent works that refine its approach to minimize accuracy loss, especially in dealing with activations.

Low-Rank Adaptation (LoRA) is identified as a pivotal method for fine-tuning LMs while minimally updating parameters, thereby reducing the memory overhead traditionally associated with fine-tuning large models. By focusing on low-rank matrices, LoRA demonstrates efficiency by saving costs in gradient descent processes, marking it as essential for enhancing LMs.

Desired Properties

The paper highlights two critical properties that successful low-cost LM compression algorithms must possess. Firstly, direct incorporation of task-specific objective functions is vital; proxy objectives used for local layer-wise reconstruction errors can lead to suboptimal results. Secondly, an iterative compression process proves advantageous in mitigating errors at each iteration, thereby preserving the innate knowledge acquired during pre-training.

Future Research

Looking ahead, several promising research areas have been identified. The quest for efficient iterative algorithms that could further enhance the accuracy of compressed models remains critical, especially for LLMs where traditional retraining processes are resource-prohibitive. Effective strategies for directly optimizing the target objection function, quantizing activations of LLMs, and unifying diverse compression algorithms pave the way for future innovation. The fusion of PEFT with traditional high-cost algorithms holds particular promise for reducing the cost of fine-tuning while maintaining accuracy.

Conclusion

The survey concludes that the amalgamation of various compression techniques could lead to unprecedented compression rates for language models, particularly for the increasingly relevant LLMs. The findings and discussions encapsulated in this paper aim to steer future developments in the field, promoting both cost-effective and performance-optimized compression avenues. The ultimate aim is to democratize access to advanced AI capabilities by making LLMs more resource-efficient and, thus, more widely deployable.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.