Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity (2310.05175v4)

Published 8 Oct 2023 in cs.LG

Abstract: LLMs, renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.

References (50)

Citations (52)

View on Semantic Scholar

Summary

The paper proposes a non-uniform layerwise pruning method that tailors sparsity to activation outlier distributions.
The methodology outperforms baselines by achieving a 61.22 perplexity improvement at 70% sparsity and a 2.6x inference speedup.
The approach advances LLM optimization by aligning pruning with model architecture, promising efficiency in resource-constrained deployments.

Outlier Weighed Layerwise Sparsity: Advanced Pruning Techniques for LLMs

The paper "Outlier Weighed Layerwise Sparsity (OWL 69): A Missing Secret Sauce for Pruning LLMs to High Sparsity" introduces a novel approach to pruning LLMs with a focus on leveraging the unique properties of activation outliers. OWL, as proposed by the authors, emerges as a groundbreaking concept that challenges the conventional wisdom of uniform sparsity pruning, advocating for a tailored, non-uniform layerwise sparsity approach. The paper provides an empirical investigation into the distribution of activation outliers across layers of LLMs and proposes a layerwise pruning strategy that aligns sparsity with outlier distributions, enhancing model performance and inference speed without requiring extensive retraining.

The authors meticulously analyze the limitations of existing pruning methodologies, such as SparseGPT and Wanda, which adhere to uniform sparsity across layers. They reveal a significant relationship between the presence of activation outliers and the distribution of pruning efficacy, suggesting that incorporating these outliers into the pruning process can yield substantial performance improvements. By adjusting layerwise sparsity ratios according to the prevalence of outliers in each layer, OWL optimizes pruning without disregarding the structure of activation outliers, which are often pivotal to LLM performance.

Experimental results underscore the effectiveness of OWL, demonstrating a notable improvement over baseline methods in terms of perplexity and inference efficiency. Specifically, OWL outperforms leading techniques like Wanda and SparseGPT by achieving a 61.22 perplexity improvement at a high sparsity level of 70% and a 2.6x inference speedup on the DeepSparse engine. OWL not only excellent performance of large LLaMA-V1 and OPT models, encompassing a parameter scale from billions to tens of billions but also exhibits robustness across a mixture of model sizes and architectures.

Theoretical implications of OWL extend beyond the immediate field of LLM pruning. This method promotes a deeper understanding of how feature magnitudes and network architectures interplay, potentially influencing future research on the compression and efficiency of neural networks. The findings suggest that the optimal pruning strategy should vary across different model architectures and use cases, taking into account unique characteristics such as layer-specific outlier distributions and their impact on computational resources.

The paper also explores practical applications of OWL in diverse contexts, including structured pruning, mixed-precision quantization, and low-rank approximations, suggesting promising avenues for deployment in hardware-constrained environments. The comprehensive evaluation and robust demonstrations of OWL attest to its potential in revolutionizing the approach to LLM sparsity and optimization, setting the stage for more nuanced, adaptable strategies that can better adapt to the constraints of resource-limited scenarios.

In summary, the introduction of Outlier Weighed Layerwise Sparsity marks a pivotal advancement in the field of model pruning. By addressing the limitations of conventional uniform layerwise sparsity, OWL not only advances the current understanding of LLM pruning strategies but also opens pathways for future research to explore adaptive pruning regimes that align model efficiency with practical deployment needs. As AI continues to progress towards more fine-grained, tailor-fit methodologies, OWL stands as a testament to the importance of considering emergent phenomena within large-scale models to drive forward the practical applicability and sustainability of artificial intelligence systems.

PDF Markdown

Related Papers

GitHub

GitHub - luuyin/OWL: Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity" (71 stars)

Tweets

https://twitter.com/ShiweiLiu9/status/1787399052763501007