Variational Dropout Sparsifies Deep Neural Networks (1701.05369v3)

Published 19 Jan 2017 in stat.ML and cs.LG

Abstract: We explore a recently proposed Variational Dropout technique that provided an elegant Bayesian interpretation to Gaussian Dropout. We extend Variational Dropout to the case when dropout rates are unbounded, propose a way to reduce the variance of the gradient estimator and report first experimental results with individual dropout rates per weight. Interestingly, it leads to extremely sparse solutions both in fully-connected and convolutional layers. This effect is similar to automatic relevance determination effect in empirical Bayes but has a number of advantages. We reduce the number of parameters up to 280 times on LeNet architectures and up to 68 times on VGG-like networks with a negligible decrease of accuracy.

Citations (801)

View on Semantic Scholar

Summary

The paper extends standard variational dropout by enabling unbounded dropout rates to effectively prune redundant weights in deep networks.
It introduces an additive noise reparameterization and a tight KL-divergence approximation to reduce gradient variance and achieve high sparsity.
Empirical results show up to a 280x reduction in parameters with minimal accuracy loss, enhancing model efficiency and generalization.

Variational Dropout Sparsifies Deep Neural Networks

The paper "Variational Dropout Sparsifies Deep Neural Networks" by Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov presents a significant extension to the variational dropout method. Variational Dropout was initially proposed as a Bayesian interpretation of dropout in neural networks, providing a framework within which individual dropout rates could be tuned. In this work, the authors introduced Sparse Variational Dropout (Sparse VD) and demonstrated its ability to lead to highly sparse network structures.

Methodology and Key Contributions

The primary contributions of this paper can be categorized into methodological advancements and empirical validations:

Extension of Variational Dropout: The authors extended variational dropout to accommodate unbounded dropout rates. This is crucial because higher dropout rates ( $\alpha \to \infty$ ) correspond to a situation where weights are nearly always dropped, effectively pruning them from the network.
Additive Noise Reparameterization: They proposed an additive noise reparameterization to reduce the variance of the gradient estimator. Traditional multiplicative noise reparameterization becomes unstable for large dropout rates. By switching to an additive approach, where weights are represented as $\theta_{ij} + \sigma_{ij} \cdot \epsilon_{ij}$ , they manage to maintain the same posterior distribution while significantly reducing the noise-induced variance in the gradients.
KL-Divergence Approximation: The paper presented a new approximation for the Kullback-Leibler (KL) divergence term that remains tight over the full range of $\alpha$ . This improvement is essential for correctly optimizing dropout rates in a broad range of scenarios.

Empirical Results

The experimental evaluations conducted in this paper were thorough, spanning several architectures and datasets such as LeNet on MNIST and VGG-like networks on CIFAR-10/100. Key observations from the experiments include:

High Sparsity: Sparse VD led to models with extremely high sparsity levels. For instance, they were able to reduce the number of parameters by up to 280 times on LeNet-300-100 with minimal degradation in accuracy. For VGG-like networks, a reduction factor of up to 68 times was observed.
Accuracy Maintenance: Despite the aggressive sparsification, the model retained a negligible drop in accuracy. This indicates that a significant portion of the parameters in the originally dense networks is redundant.
Generalization Properties: The research also investigated the robustness of Sparse VD against overfitting on datasets with random labels. Unlike traditional dropout, which fails to prevent overfitting in these scenarios, Sparse VD naturally avoided learning from randomly labeled data.

Implications and Future Directions

The implications of Sparse VD are multi-fold in both theoretical and practical contexts:

Model Compression: With the ability to prune a significant number of weights without sacrificing performance, Sparse VD can lead to highly efficient models in terms of both computation and memory. This efficiency is of particular importance for deploying deep learning models on edge devices with limited resources.
Understanding DNN Generalization: The empirical results highlighting the method's ability to prevent overfitting on randomly labeled data add a new dimension to understanding the generalization properties of deep neural networks. Sparse VD implicitly penalizes memorization through variational inference, which could open pathways for further theoretical work exploring these regularization effects.
Integration with Other Advances: Sparse VD can be combined with other network compression techniques like quantization and Huffman coding to further enhance compression ratios. Additionally, exploring structured sparsity within this framework could lead to significant computational acceleration, extending its utility beyond mere memory savings.

Conclusion

The introduction of Sparse Variational Dropout represents a notable advance in the field of neural network regularization and sparsification. By addressing the limitations of traditional variational dropout in handling high dropout rates and proposing effective variance reduction techniques, the authors provided a practical and theoretically sound method to achieve extreme sparsification. Future research directions include exploring its integration with other model compression techniques, further theoretical analysis of its generalization properties, and extending it to induce structured sparsity for practical speedups. This work stands as an essential step towards more efficient and scalable deep learning models.

PDF Markdown

Related Papers

YouTube

Show All Videos