Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity (2310.05175v4)
Abstract: LLMs, renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.
- Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821, 2021.
- Language models are few-shot learners. Advances in neural information processing systems (NeurIPs), 33:1877–1901, 2020.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (NeurIPs), 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
- Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning (ICML), pp. 2943–2952, 2020.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
- Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning (ICML), 2023.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE, 2020.
- Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1135–1143, 2015.
- Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. IEEE, 1993.
- The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023.
- Steven A Janowsky. Pruning versus clipping in neural networks. Physical Review A, 39(12):6600, 1989.
- The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
- Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), pp. 598–605, 1989.
- Layer-adaptive sparsity for the magnitude-based pruning. arXiv preprint arXiv:2010.07611, 2020.
- Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations (ICLR), 2019.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2790–2799, 2019.
- Sparse training via boosting pruning plasticity with neuroregeneration. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022.
- Estimating the carbon footprint of bloom, a 176b parameter language model. arXiv preprint arXiv:2211.02001, 2022.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016a.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016b.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Studying the plasticity in deep convolutional neural networks using random pruning. Machine Vision and Applications, 30(2):203–216, 2019.
- A topological insight into restricted boltzmann machines. Machine Learning, 104(2):243–270, Sep 2016. ISSN 1573-0565. doi: 10.1007/s10994-016-5570-z. URL https://doi.org/10.1007/s10994-016-5570-z.
- Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9:1–12, 2018.
- Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems (NeurIPS), pp. 107–115, 1989.
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- Movement pruning: Adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683, 2020.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations (ICLR), 2020.
- Rethinking the value of transformer components. arXiv preprint arXiv:2011.03803, 2020.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- Learning intrinsic sparse structures within long short-term memory. arXiv preprint arXiv:1709.05027, 2017.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), pp. 38087–38099. PMLR, 2023.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- To prune, or not to prune: exploring the efficacy of pruning for model compression. In International Conference on Learning Representations Workshop (ICLRW), 2017.