Faster CNNs with Direct Sparse Convolutions and Guided Pruning

Published 4 Aug 2016 in cs.CV | (1608.01409v5)

Abstract: Phenomenally successful in practical inference problems, convolutional neural networks (CNN) are widely deployed in mobile devices, data centers, and even supercomputers. The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits. While pruning the fully connected layers reduces a CNN's size considerably, it does not improve inference speed noticeably as the compute heavy parts lie in convolutions. Pruning CNNs in a way that increase inference speed often imposes specific sparsity structures, thus limiting the achievable sparsity levels. We present a method to realize simultaneously size economy and speed improvement while pruning CNNs. Paramount to our success is an efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns. Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures. Together, these two allow us to demonstrate 3.1--7.3$\times$ convolution speedups over dense convolution in AlexNet, on Intel Atom, Xeon, and Xeon Phi processors, spanning the spectrum from mobile devices to supercomputers. We also open source our project at https://github.com/IntelLabs/SkimCaffe.

Abstract PDF Upgrade to Chat

Citations (177)

View on Semantic Scholar

Summary

The paper introduces a direct sparse convolution method that reformulates convolutions to enhance arithmetic intensity and computational efficiency.
The performance model estimates speedup based on non-zero density, showing that moderate sparsity (around 70%) can yield significant acceleration, such as a 7.3× increase on AlexNet.
Guided Sparsity Learning strategically prunes layers within effective sparsity ranges, ensuring maximum inference speedup without compromising model accuracy.

Faster CNNs with Direct Sparse Convolutions and Guided Pruning: A Summary

This paper introduces novel methodologies to enhance the computational efficiency of Convolutional Neural Networks (CNNs) through direct sparse convolutions and guided pruning strategies. The authors aim to address the challenge of CNNs' excessive parameter count, which traditionally results in substantial computational overhead, particularly in the convolution layers that dominate CNNs' processing time.

Key Contributions

Direct Sparse Convolutions: The authors propose a direct sparse convolution technique as a core advancement. This method reformulates sparse convolutions as sparse-matrix-dense-matrix multiplications without the usual overhead of lowering input tensors to matrices—a process noted to reduce arithmetic intensity and efficiency. This approach allows the convolution operations to maintain a high arithmetic intensity by using a "virtual" dense matrix, enhancing data reuse especially in multi-channel scenarios.
Performance Modelling: A sophisticated performance model is developed to predict speedup potentials and guide the pruning process. The model uses the operational roofline to calculate potential speed improvements depending on the non-zero density of the sparse convolution kernels and the characteristics of specific processor architectures. Notably, the model demonstrates that even moderate sparsity in the range of 70% can facilitate substantial speed increases using the devised methods.
Guided Sparsity Learning (GSL): The paper introduces Guided Sparsity Learning (GSL), an innovative pruning algorithm that strategically focuses on layers and sparsity ranges promising tangible speedup, informed by the performance model. Unlike typical pruning, GSL ceases pruning efforts in layers that fall outside effective sparsity ranges, reallocating effort where maximal speedup is achievable.
Empirical Validation: The methods are empirically validated through experiments on AlexNet and GoogLeNet across diverse computational platforms—Intel Atom, Xeon, and Xeon Phi processors, showing promising speedups (up to 7.3× for AlexNet on the Atom processor) without compromising model accuracy.

Implications and Future Directions

The implications arising from this research are multifold:

From a theoretical perspective, this work expands on the potential of direct sparse computation in deep learning frameworks, effectively bridging the gap between pruning-induced model size reduction and actual inference speedup.
Practically, the proposed methodologies align with current trends towards deploying CNNs on resource-constrained environments, such as mobile and edge computing, where computational efficiency is paramount.
Future Work: While the current implementation focused on direct sparse convolution efficiencies, the authors propose potential extensions incorporating Winograd and FFT-based algorithms to further refine 1×1 convolution efficiencies, currently not addressed by sparsity methods due to inherent low arithmetic intensity.

Overall, this paper makes substantive contributions towards more computationally efficient CNN implementations, establishing a practical approach for systematically leveraging model sparsity for faster inference while maintaining a theoretical underpinning through performance modelling. Through continued applications and optimizations, these advancements promise to significantly enhance the deployment capabilities of deep learning models across an expanded array of hardware platforms.

Markdown Report Issue