LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation (2306.11222v2)
Abstract: Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.
- The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001.
- Dsee: Dually sparsity-embedded efficient tuning of pre-trained language models. arXiv preprint arXiv:2111.00160, 2021.
- The pascal recognising textual entailment challenge. In Quiñonero-Candela, J., Dagan, I., Magnini, B., and d’Alché Buc, F. (eds.), Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pp. 177–190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.
- The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2007.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 1–9, Prague, June 2007. Association for Computational Linguistics.
- Compressing pre-trained language models using progressive low rank decomposition. 2021.
- Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
- Low-rank+ sparse tensor compression for neural networks. arXiv preprint arXiv:2111.01697, 2021.
- Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
- Language model compression with weighted low-rank factorization. arXiv preprint arXiv:2207.00112, 2022a.
- Language model compression with weighted low-rank factorization. ArXiv, abs/2207.00112, 2022b.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- A dirty model for multi-task learning. Advances in neural information processing systems, 23, 2010.
- Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703.
- Super tickets in pre-trained language models: From model compression to improving generalization. arXiv preprint arXiv:2105.12002, 2021.
- Mixkd: Towards efficient distillation of large-scale language models. ArXiv, abs/2011.00593, 2020.
- Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Learning sparse neural networks through l_0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
- Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360, 2019.
- Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
- Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11264–11272, 2019.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126, 2020.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
- Compressing pre-trained language models by matrix decomposition. In AACL, 2020.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016a. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016b.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
- Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, 2019.
- Contrastive distillation on intermediate representations for language model compression. In Conference on Empirical Methods in Natural Language Processing, 2020.
- Kroneckerbert: Learning kronecker decomposition for pre-trained language models via knowledge distillation. arXiv preprint arXiv:2109.06243, 2021.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. doi: 10.1162/tacl˙a˙00290.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101.
- Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408, 2022.
- Bert-of-theseus: Compressing bert by progressive module replacing. In Conference on Empirical Methods in Natural Language Processing, 2020.
- On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7370–7379, 2017.
- Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pp. 26809–26823. PMLR, 2022.
- To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.