PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning (2001.00138v4)

Published 1 Jan 2020 in cs.LG, cs.CV, and cs.DC

Abstract: With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

Authors (8)

Wei Niu (68 papers)
Xiaolong Ma (57 papers)
Sheng Lin (29 papers)
Shihao Wang (32 papers)
Xuehai Qian (40 papers)
Xue Lin (92 papers)
Yanzhi Wang (197 papers)
Bin Ren (136 papers)

Citations (214)

View on Semantic Scholar

Summary

The paper presents a novel pattern-based pruning strategy that balances model accuracy and hardware efficiency for mobile deep neural networks.
It integrates compiler optimizations such as filter reordering and load redundancy elimination to accelerate execution on both CPUs and GPUs.
Empirical results on VGG-16 and ResNet-50 show up to 44.5x speedup, enabling real-time DNN inference on mobile devices.

PatDNN: Advancing Mobile Deep Neural Network Execution

This paper introduces PatDNN, a novel system designed to optimize the execution of deep neural networks (DNNs) on mobile devices by employing a pattern-based weight pruning approach combined with compiler and code generation optimizations. The primary goal of this framework is to achieve real-time inference capabilities on computationally constrained mobile hardware while maintaining high model accuracy.

The research addresses a critical challenge in the deployment of DNNs on mobile devices: the need for a balance between execution speed and model accuracy. Traditional pruning techniques either favor non-structured pruning, which leads to high accuracy but is not hardware-friendly, or structured pruning, which is more efficient for mobile hardware but at the cost of reduced accuracy. PatDNN strives to bridge this gap with pattern-based pruning that operates at a fine granularity within the framework of a coarse-grained structure, leveraging a unique compiler optimization strategy to ensure both accuracy and efficiency.

Key Contributions

Pattern-Based Pruning Strategy: The paper proposes a new pruning scheme that defines specific patterns within kernels, allowing for a balance between the flexibility of selecting active weights and the regularity required for efficient hardware utilization. This technique effectively combines the benefits of both structured and non-structured pruning.
Optimized Compiler and Execution Framework: Complementing the pattern-based pruning, PatDNN encompasses multiple compiler optimizations, such as filter kernel reordering and load redundancy elimination, which enhance execution efficiency by improving instruction- and parallel-level performance. The system is designed with both CPUs and GPUs in mind, ensuring broad adaptability across different mobile architectures.
Empirical Evaluation and Results: Evaluation on popular DNN architectures like VGG-16 and ResNet-50, using datasets such as ImageNet and CIFAR-10, demonstrates that PatDNN significantly outperforms existing frameworks like TensorFlow Lite, TVM, and Alibaba MNN, achieving up to 44.5x speedup without compromising accuracy. The framework achieves real-time inference on high-end mobile devices, paving the way for more complex DNN applications on mobile platforms.

Implications and Future Directions

The success of PatDNN has several implications for both the practical deployment of AI on mobile devices and future theoretical research. Practically, the ability to execute complex DNNs in real-time on mobile hardware without accuracy loss can lead to the development of more sophisticated AI applications in areas such as augmented reality, mobile gaming, and real-time translation. Theoretically, this work suggests further exploration into hybrid pruning strategies and compiler optimizations that encompass both pattern-based and other novel dimensions to enhance DNN performance.

Future advancements may focus on refining the pruning patterns to adaptively optimize for specific model architectures or hardware configurations, potentially employing machine learning techniques to automate such adaptations. Additionally, exploring the integration of PatDNN's strategies into larger, cloud-based AI frameworks could further extend its applicability and efficiency, enabling seamless scaling from edge to cloud.

In conclusion, PatDNN represents a significant step forward in the efficient execution of neural networks on mobile devices, with its pattern-based pruning and compiler optimizations laying the groundwork for future research and development in the field. The success and insights of this work highlight the potential of algorithm-architecture co-design in optimizing AI workloads across diverse platforms.

PDF Markdown

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning (2001.00138v4)

Summary

PatDNN: Advancing Mobile Deep Neural Network Execution

Key Contributions

Implications and Future Directions

Related Papers