Quantized Convolutional Neural Networks for Mobile Devices (1512.06473v3)

Published 21 Dec 2015 in cs.CV

Abstract: Recently, convolutional neural networks (CNN) have demonstrated impressive performance in various computer vision tasks. However, high performance hardware is typically indispensable for the application of CNN models due to the high computation complexity, which prohibits their further extensions. In this paper, we propose an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models. Both filter kernels in convolutional layers and weighting matrices in fully-connected layers are quantized, aiming at minimizing the estimation error of each layer's response. Extensive experiments on the ILSVRC-12 benchmark demonstrate 4~6x speed-up and 15~20x compression with merely one percentage loss of classification accuracy. With our quantized CNN model, even mobile devices can accurately classify images within one second.

Citations (1,135)

View on Semantic Scholar

Summary

The paper introduces a unified quantization framework that optimizes both convolutional and fully-connected layers for efficient mobile deployment.
It employs an error correction training scheme to reduce inference error while maintaining near-original accuracy even with significant compression.
Benchmarking on ILSVRC-12 and deployment on a Huawei Mate 7 demonstrate a 4–6× speedup and practical utility for real-time on-device processing.

Quantized Convolutional Neural Networks for Mobile Devices

The paper "Quantized Convolutional Neural Networks for Mobile Devices" by Jiaxiang Wu et al. addresses the critical issue of deploying computationally intensive Convolutional Neural Networks (CNNs) on resource-constrained platforms like mobile devices. Recognizing the substantial computational and storage overheads associated with modern CNN architectures, the authors propose a novel framework called Quantized CNN (Q-CNN) that aims to simultaneously accelerate model inference and compress storage without a significant loss in performance.

Key Contributions and Methodology

The Q-CNN framework focuses on quantizing both the filter kernels in convolutional layers and the weight matrices in fully-connected layers. The quantization minimizes the estimation error of each layer's response rather than merely reducing the quantization error of the network parameters. By doing so, it ensures that the quantized network's inference accuracy remains close to the original network's accuracy.

Main Contributions:

Unified Q-CNN Framework: The authors propose a unified approach to effectively quantize both convolutional and fully-connected layers in a CNN.
Error Correction Training: An effective training scheme is introduced to account for and correct the accumulative error across multiple quantized layers.
Extensive Benchmarking: The performance of Q-CNN is extensively validated through experiments on the ILSVRC-12 dataset with well-known CNN architectures such as AlexNet, CaffeNet, CNN-S, and VGG-16.
Hardware Implementation: The paper demonstrates that the Q-CNN framework can be implemented on mobile devices, achieving significant computational speed-up and storage reduction.

Results and Comparative Analysis

The Q-CNN approach’s efficacy is illustrated through several key numerical results:

Speed-Up and Compression: On the ILSVRC-12 benchmark, Q-CNN achieves a 4 to 6 times speed-up and a 15 to 20 times compression rate with less than 1% degradation in classification accuracy.
Error Correction Benefits: Leveraging error correction in the quantization process significantly mitigates the performance loss. For instance, in VGG-16, Q-CNN demonstrates a 4.06 times speed-up with only a 0.58% increase in top-5 classification error post-error correction.
Mobile Device Efficiency: The Q-CNN implementation on a Huawei® Mate 7 smartphone allows for image classification within one second while reducing storage requirements by a factor of 20.

Practical and Theoretical Implications

From a practical standpoint, Q-CNN facilitates the deployment of sophisticated CNN models on mobile and other resource-limited devices, potentially expanding the applications of deep learning in areas like real-time image recognition, augmented reality, and mobile health diagnostics. The storage and memory reductions also make it feasible to run these models offline, enhancing user privacy and reducing latency issues associated with cloud-based inference.

Theoretically, the error correction methodology introduces a more nuanced approach to model quantization. By directly minimizing the inference error instead of the parameter quantization error, Q-CNN ensures that the quantized model maintains high performance levels even with substantial compression. This could inspire further research into hybrid quantization-error correction schemes for other deep learning architectures.

Future Directions

Future research could explore quantization strategies tailored for different neural network types, such as Recurrent Neural Networks (RNNs) or Transformers, to extend Q-CNN's benefits beyond CNNs. Additionally, hardware-oriented optimizations, such as exploiting GPU or FPGA accelerations specifically for quantized operations, could provide even greater efficiency boosts.

In conclusion, the paper presents a comprehensive and effective solution to a pressing problem in deploying deep learning models on mobile devices. The proposed Q-CNN framework demonstrates that significant computational and storage savings can be achieved with minimal accuracy loss, thereby paving the way for more accessible and scalable AI solutions.