How Sparse Can We Prune A Deep Network: A Fundamental Limit Viewpoint (2306.05857v4)

Published 9 Jun 2023 in stat.ML and cs.LG

Abstract: Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. Moreover, we provide efficient countermeasures to address the challenges in the computation of the pruning limit, which mainly involves the accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can also provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed which demonstrate that our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: https://github.com/QiaozheZhang/Global-One-shot-Pruning

Summary

The paper frames network pruning as a set intersection problem, using statistical dimension to set necessary and sufficient sparsity conditions.
The paper shows that smaller weight magnitudes and flatter loss landscapes enable more aggressive pruning without performance loss.
The paper introduces an efficient spectrum estimation method for non-positive Hessian matrices, validated across multiple architectures.

Overview of the Paper "How Sparse Can We Prune A Deep Network: A Fundamental Limit Viewpoint"

The paper "How Sparse Can We Prune A Deep Network: A Fundamental Limit Viewpoint" addresses a crucial question in the field of deep learning: How aggressively can we prune a deep neural network without compromising its performance? Network pruning is a technique that reduces the number of parameters in a model, thus alleviating computational and storage burdens. This work provides a theoretical foundation for determining the limits of network sparsity through the lens of convex geometry and statistical dimension.

Key Contributions and Theoretical Insights

Pruning Limit as a Set Intersection Problem: The authors propose framing the network pruning problem as a set intersection challenge. Specifically, they utilize the statistical dimension—a concept drawn from high-dimensional convex geometry—to characterize the intersection of a sparsity-constrained set with a loss sublevel set. This allows them to precisely determine the necessary and sufficient conditions for network sparsity without sacrificing performance.
Role of Weight Magnitude and Loss Landscape Flatness: Two critical factors identified by the research are weight magnitude and network flatness (trace of the Hessian matrix). The findings suggest that networks with smaller weight magnitudes and flatter loss landscapes can tolerate more aggressive pruning. This insight provides a theoretical underpinning for practices in existing pruning algorithms, which often hinge on parameter magnitude as a criterion for pruning.
Efficient Spectrum Estimation: The paper introduces an improved method for estimating the spectrum of large-scale and non-positive Hessian matrices, which is crucial for understanding the loss sublevel set. This enhancement is a significant technical contribution, making the theoretical framework more practical for large models common in deep learning.
Experimental Validation: Comprehensive experiments are conducted across multiple architectures and datasets, showing strong agreement between theoretical predictions and empirical results. The pruning ratio thresholds predicted by the theory closely match the experimentally determined values, validating the framework's applicability.

Implications and Future Directions

The findings of this paper have important implications for both theory and practice in neural network design:

Theoretical Implications:

The work formalizes the relationship between model capacity, as measured by parameter count, and the geometric properties of the loss landscape. This could inform future explorations into understanding DNN capacity more broadly and might drive further theoretical inquiries into the role of flatness in generalization.

Practical Implications:

For practitioners, the insights about weight magnitude and flatness can guide the development of more efficient pruning algorithms. These insights encourage a focus on maintaining flatness during training to enable more aggressive pruning.

Future Research Directions:

The paper raises questions about how these principles might guide network design before training, potentially allowing for preemptive architecture compression without post-training pruning. Additionally, exploring the relationship between flatness optimization and network robustness could yield new strategies for enhancing model performance.

Overall, this paper offers a rigorous and nuanced perspective on network pruning, backed by robust theoretical insights and empirical validation. It paves the way for more informed strategies in managing model complexity while maintaining performance excellence in deep learning applications.