Movement Pruning: Adaptive Sparsity by Fine-Tuning

Published 15 May 2020 in cs.CL and cs.LG | (2005.07683v2)

Abstract: Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained LLMs, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (427)

View on Semantic Scholar

Summary

The paper introduces a novel movement pruning method that adapts weight pruning by tracking changes during fine-tuning.
It leverages gradient-based importance scores and straight-through estimators to overcome non-differentiability challenges.
Experimental results show that the soft variant maintains around 95% accuracy even when pruning BERT-base to 5% of encoder weights.

Overview of "Movement Pruning: Adaptive Sparsity by Fine-Tuning"

Model sparsity is a pivotal factor in neural network efficiency, especially for the deployment of large pretrained models in resource-constrained environments. The paper "Movement Pruning: Adaptive Sparsity by Fine-Tuning" addresses the need for more adaptive pruning techniques in the context of transfer learning of LLMs. The authors introduce a novel pruning method that they term "movement pruning," which shows proficiency particularly in high-sparsity regimes.

The concept of movement pruning diverges from the traditional magnitude pruning by focusing on weight changes during the fine-tuning process rather than their absolute magnitudes in the pre-training phase. This shift to a first-order method from a zeroth-order approach allows the pruning mechanism to better adapt to specific fine-tuning tasks. The methodology employed involves maintaining an importance score for each weight, updated using gradients during training, combined with straight-through estimators to circumvent non-differentiability issues inherent in step functions.

Methodology

The methodological novelty of the paper lies in how movement pruning utilizes changes in weights, capturing movement away from zero across the training iterations. The approach is embodied in two variants: hard movement pruning, using a deterministic method to rank and select weights with the largest movement scores; and soft movement pruning, employing thresholds to determine pruning with an additional regularization term that controls sparsity levels throughout training.

Experimental Results

The empirical validation of movement pruning demonstrates its effectiveness across a range of established NLP tasks including SQuAD, MNLI, and QQP. The results reveal that movement pruning not only surpasses magnitude pruning in high-sparsity scenarios but also compares favorably with existing advanced pruning techniques like $L_0$ regularization.

For instance, the soft movement pruning method retains around 95% of original model accuracy despite pruning BERT-base to retain merely 5% of encoder weights. Furthermore, when coupled with distillation, movement pruning methods show even lower performance degradation at extreme sparsity, thereby highlighting potential for practical compression applications without significant loss in task performance.

Implications and Future Directions

Movement pruning represents a significant advancement in model compression for transfer learning. It opens avenues for efficiently deploying state-of-the-art NLP models on edge devices, contributing to reduced energy consumption and enhanced privacy by eliminating the need for continual data communication with centralized servers.

From a theoretical standpoint, the paper provides grounds to reconsider weight pruning criteria, advocating for task-adapted pruning mechanisms rather than static pre-trained model reductions. Future developments could explore synergies between movement pruning and structured pruning techniques to enhance both interpretability and computational efficiency. Additionally, examining the integration of hardware-specific optimizations for these highly sparse models could broaden the application scope of movement pruning within industry and research sectors alike.