A Comprehensive Survey of Compression Algorithms for Language Models (2401.15347v1)

Published 27 Jan 2024 in cs.CL and cs.AI

Abstract: How can we compress LLMs without sacrificing accuracy? The number of compression algorithms for LLMs is rapidly growing to benefit from remarkable advances of recent LLMs without side effects due to the gigantic size of LLMs, such as increased carbon emissions and expensive maintenance fees. While numerous compression algorithms have shown remarkable progress in compressing LLMs, it ironically becomes challenging to capture emerging trends and identify the fundamental concepts underlying them due to the excessive number of algorithms. In this paper, we survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. We not only summarize the overall trend of diverse compression algorithms but also select representative algorithms and provide in-depth analyses of them. We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of LLMs. Finally, we introduce promising future research topics based on our survey results.

References (174)

Citations (9)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey comparing multiple compression techniques including pruning, quantization, distillation, and low-rank adaptation.
It details methodologies such as Optimal Brain Damage-based pruning, round-to-nearest quantization, and low-rank adaptation to reduce model size and computation.
The paper underscores the importance of iterative compression and direct task-specific objective functions to enhance future LM compression accuracy.

Overview

The landscape of LLM (LM) compression is vast, with an array of algorithms vying to reduce the size and computational demands of these models without impinging on their accuracy. This paper presents a comprehensive overview of such algorithms, including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. The analysis here explores the intricacies of each approach, evaluates their performance, and compares their effectiveness. Discerning the nuances between high-cost and low-cost approaches, the paper also underscores critical attributes that successful LM compression algorithms should possess.

Representative Compression Algorithms

Among the several algorithms surveyed, a few stand out for their contribution to the field. SparseGPT makes significant strides in pruning methodologies, successfully handling LLMs and extending its pruning technique to accommodate semi-structured sparsity patterns like 2:4 sparsity. The algorithm optimizes weight pruning using optimal brain damage (OBD)-derived strategies and notably curtails the computational demands of the Hessian inversion.

In quantization, OPTQ emerges as a potent tool for compressing the colossal parameter matrices of LLMs. The key to OPTQ's success lies in its round-to-nearest approximation that mitigates degradation in precision by optimizing weight adjustments post-quantization. Its strength is further bolstered by improvements from subsequent works that refine its approach to minimize accuracy loss, especially in dealing with activations.

Low-Rank Adaptation (LoRA) is identified as a pivotal method for fine-tuning LMs while minimally updating parameters, thereby reducing the memory overhead traditionally associated with fine-tuning large models. By focusing on low-rank matrices, LoRA demonstrates efficiency by saving costs in gradient descent processes, marking it as essential for enhancing LMs.

Desired Properties

The paper highlights two critical properties that successful low-cost LM compression algorithms must possess. Firstly, direct incorporation of task-specific objective functions is vital; proxy objectives used for local layer-wise reconstruction errors can lead to suboptimal results. Secondly, an iterative compression process proves advantageous in mitigating errors at each iteration, thereby preserving the innate knowledge acquired during pre-training.

Future Research

Looking ahead, several promising research areas have been identified. The quest for efficient iterative algorithms that could further enhance the accuracy of compressed models remains critical, especially for LLMs where traditional retraining processes are resource-prohibitive. Effective strategies for directly optimizing the target objection function, quantizing activations of LLMs, and unifying diverse compression algorithms pave the way for future innovation. The fusion of PEFT with traditional high-cost algorithms holds particular promise for reducing the cost of fine-tuning while maintaining accuracy.

Conclusion

The survey concludes that the amalgamation of various compression techniques could lead to unprecedented compression rates for LLMs, particularly for the increasingly relevant LLMs. The findings and discussions encapsulated in this paper aim to steer future developments in the field, promoting both cost-effective and performance-optimized compression avenues. The ultimate aim is to democratize access to advanced AI capabilities by making LLMs more resource-efficient and, thus, more widely deployable.

Tweets

https://twitter.com/omarsar0/status/1752746772743467356

https://twitter.com/omarsar0/status/1754514303737041335

https://twitter.com/omarsar0/status/1752744401510470084

https://twitter.com/MuzafferKal_/status/1753900322773492088