Learning Sparse Neural Networks through $L_0$ Regularization (1712.01312v2)

Published 4 Dec 2017 in stat.ML and cs.LG

Abstract: We propose a practical method for $L_0$ norm regularization for neural networks: pruning the network during training by encouraging weights to become exactly zero. Such regularization is interesting since (1) it can greatly speed up training and inference, and (2) it can improve generalization. AIC and BIC, well-known model selection criteria, are special cases of $L_0$ regularization. However, since the $L_0$ norm of weights is non-differentiable, we cannot incorporate it directly as a regularization term in the objective function. We propose a solution through the inclusion of a collection of non-negative stochastic gates, which collectively determine which weights to set to zero. We show that, somewhat surprisingly, for certain distributions over the gates, the expected $L_0$ norm of the resulting gated weights is differentiable with respect to the distribution parameters. We further propose the \emph{hard concrete} distribution for the gates, which is obtained by "stretching" a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result our method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way. We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.

Citations (1,070)

View on Semantic Scholar

Summary

The paper presents a novel surrogate framework that approximates L0 regularization using stochastic hard concrete gates.
It demonstrates efficient gradient-based optimization that yields compact models with minimal loss in accuracy on benchmarks like MNIST and CIFAR.
The method significantly reduces model complexity and FLOPs, paving the way for conditional computation and improved generalization.

Learning Sparse Neural Networks Through $L_0$ Regularization

The paper "Learning Sparse Neural Networks through $L_0$ Regularization" by Louizos et al. presents a novel and practical method for enforcing $L_0$ norm regularization in neural networks. This article provides an expert overview of the methodology, empirical results, and implications.

Deep neural networks, while highly flexible and powerful, often suffer from over-parameterization and overfitting. These issues necessitate techniques for model compression and regularization. Traditional methods, such as $L_1$ regularization, shrink parameter values towards zero but fail to achieve actual sparsity where weights are exactly zero. The $L_0$ norm, which counts the number of non-zero elements, is theoretically ideal for enforcing sparsity but is computationally intractable due to its non-differentiable and combinatorial nature.

Methodology

Louizos et al. introduce a surrogate approach to $L_0$ regularization that facilitates efficient optimization using gradient-based methods. Their key innovation involves using non-negative stochastic gates to determine which network weights should be set to zero. Specifically, they employ continuous random variables transformed by a hard sigmoid function to mimic binary behavior. Through this mechanism, they derive a differentiable approximation of the $L_0$ norm regularized objective.

The authors propose the hard concrete distribution for these gates, created by stretching a binary concrete distribution and passing its samples through a hard-sigmoid. This distribution allows for both exact zero values and efficient gradient-based optimization due to its smooth and continuous properties.

The primary contributions of this paper are:

A general framework for optimizing $L_0$ regularization with stochastic gradient descent.
The introduction of the hard concrete distribution to achieve sparsity without sacrificing differentiability.
Empirical validation demonstrating the effectiveness of this approach on various tasks.

Key Results

The authors validate their methodology on classic and modern benchmarks: MNIST, CIFAR-10, and CIFAR-100. They deploy neural network architectures, including simple multilayer perceptrons (MLPs) and Wide Residual Networks (WRNs).

For MNIST:

The proposed $L_0$ regularization achieves comparable or better performance in terms of test accuracy while significantly reducing the number of non-zero parameters. Specifically, their methods achieve an MLP architecture compression with no loss in performance, and a LeNet-5 model that balances reduction in parameters with high accuracy.

For CIFAR-10 and CIFAR-100 using WRNs:

The $L_0$ regularized models not only improve accuracy compared to dropout-based counterparts but also show substantial reductions in the expected number of floating point operations (FLOPs) during training. These reductions imply potential speedups in both training and inference.

Implications and Future Developments

The implications of this research are manifold:

Practicality in Training: The ability to introduce sparsity during training, rather than post hoc pruning, opens avenues for more efficient training regimens. Conditional computation, where only necessary parts of the network are computed, can now be more feasibly realized with principled $L_0$ regularization.
Regularization and Generalization: The method helps mitigate overfitting by reducing unnecessary model complexity, thereby improving model generalization. This makes it especially valuable for tasks with limited training data or highly complex models.
Future Research Directions: Potential avenues for future research include exploring the integration of these techniques in more extensive and deeper architectures, investigating automatic hyperparameter tuning of the regularization strength, and examining the effects of combining $L_0$ regularization with other advanced regularization methods.

Conclusion

The paper by Louizos et al. offers a significant advance in the domain of neural network regularization and sparsification. By innovatively reparametrizing the $L_0$ norm regularization problem and introducing the hard concrete distribution, the authors provide an efficient and scalable method for training sparse neural networks. The results exhibit a compelling argument for the adoption of this technique in both research and practical applications due to its potential for model efficiency and improved generalization.

PDF Markdown

Related Papers

YouTube

Show All Videos