Rethinking the Value of Network Pruning (1810.05270v2)

Published 11 Oct 2018 in cs.LG, cs.CV, and stat.ML

Abstract: Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization.

Citations (1,388)

View on Semantic Scholar

Summary

The paper reveals that training pruned architectures from scratch often matches or exceeds the performance of fine-tuned pruned models across diverse datasets and networks.
It challenges the belief that over-parameterized models are necessary, highlighting the importance of architecture design over inherited weights.
The study contrasts with the Lottery Ticket Hypothesis by demonstrating that optimal learning rates diminish the advantage of specific weight initializations.

Insightful Overview of "Rethinking the Value of Network Pruning"

The paper "Rethinking the Value of Network Pruning," by Zhuang Liu et al., addresses the prevalent technique of network pruning, primarily used to reduce the inference cost of over-parameterized deep neural networks. The authors conduct an extensive empirical evaluation to analyze the necessity and effectiveness of the conventional three-stage pruning pipeline: training a large model, pruning redundant weights based on a specific criterion, and fine-tuning the pruned model to regain any lost performance.

Key Findings and Observations

Effectiveness of Training from Scratch: The authors surprisingly discover that for state-of-the-art structured pruning methods, training the pruned architecture from scratch consistently achieves comparable or superior performance relative to fine-tuning the pruned model. This finding is consistently observed across multiple network architectures, datasets, and tasks, including CIFAR-10, CIFAR-100, and ImageNet, while utilizing architectures like VGG, ResNet, and DenseNet.
Implications on Over-parameterization: The results indicate that training an over-parameterized model is often unnecessary for deriving an efficient final model. This challenges the traditional belief that an initial large model is essential for effectively pruning and retaining high performance.
Architecture vs. Weights: The paper further reveals that the pruned architecture itself, rather than the inherited "important" weights, plays a crucial role in the final model's efficiency. This suggests that the true value of structured pruning methods may lie in implicit architecture search rather than weight selection.
Comparison with the Lottery Ticket Hypothesis: The research contrasts its findings with the "Lottery Ticket Hypothesis," which posits that certain subnetworks, when initialized correctly, can achieve comparable performance to larger models. The authors find that given an optimal learning rate, the "winning ticket" initialization does not offer improvements over random initialization—challenging the necessity of specific initializations for powerful subnetworks.

Implications of the Research

Practical Implications:

The paper advocates for more efficient training practices, specifically highlighting the advantages of directly training smaller, pruned models from scratch. This approach not only simplifies the training process but also conserves computational resources.
For practitioners, this rethinking simplifies implementation, as it eliminates the need for complex pruning mechanisms and extensive fine-tuning stages.
By showcasing that predefined pruned architectures can be effectively trained from scratch, the research support architectures that inherently require fewer training epochs, thus offering faster deployment and lower computational costs.

Theoretical Implications:

The results prompt a re-evaluation of the theoretical underpinnings behind network pruning and model over-parameterization. It underscores the necessity of reconsidering the assumptions surrounding the importance of initial model size and inherited weights.
The findings bridge a connection to neural architecture search (NAS), suggesting that structured pruning methods operate more as architecture optimizers than traditional pruning mechanisms.

Future Directions in AI

Enhanced Architecture Search Methods: The realization that network pruning can function as an architecture search method opens avenues for developing more sophisticated and targeted NAS algorithms that inherently prune networks during the search phase, balancing efficiency and performance.
Generalizable Design Patterns: Observations from successful pruned architectures can inform the design of new neural architectures, leading to innovations that embody the principles of efficient weight allocation and layer utility.
Extending Beyond Classification: Given the promising results on standard image classification tasks, future research could explore the implications of these findings in more complex tasks like object detection, natural language processing, and reinforcement learning, to assess the generalizability of the conclusions drawn.
Alternative Pruning and Training Strategies: Researchers might investigate hybrid approaches that blend elements of structured pruning and architecture search with novel training strategies. This could enhance performance while maintaining computational efficiency across diverse applications.

In conclusion, Liu et al.'s work prompts a substantial shift in how the field approaches network pruning. By challenging established paradigms and introducing evidence that pruned models trained from scratch can perform equivalently or better, this research paves the way for more efficient and streamlined development of deep learning models. As the AI landscape progresses, adopting these insights could result in significant advancements in model training and deployment efficiency, making AI more accessible and sustainable.

PDF Markdown

Related Papers

YouTube

Show All Videos