Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets
MobileViT combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to develop a light-weight, low-latency network for mobile vision tasks.
The MobileViT block integrates local feature extraction using convolutions with global dependencies handled by transformers, achieving a balance of robust performance and efficiency.
Empirical results show MobileViT outperforms previous state-of-the-art models like MobileNetv3 and DeIT on datasets such as ImageNet-1k and MS-COCO, with easier training processes and less reliance on extensive data augmentation.
The paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer" authored by Sachin Mehta and Mohammad Rastegari, introduces MobileViT, a novel approach to integrate the benefits of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for mobile vision tasks. This paper is particularly relevant for researchers interested in efficient model design for resource-constrained environments.
The practical implications of MobileViT are substantial. By bringing together the best attributes of CNNs and ViTs, MobileViT makes it feasible to deploy high-performing visual recognition models on mobile devices, where computational and memory resources are limited. The authors’ contribution lies in demonstrating that transformers can be scaled down effectively to work within the constrained parameter budgets typical of mobile devices, without significant performance trade-offs.
Theoretically, the blending of local and global processing within the MobileViT block offers a new perspective on neural network design. CNNs are known for their ability to capture local patterns via convolutions, but they struggle with global dependencies; ViTs capture global information well but require large datasets and extensive training. MobileViT leverages the strengths of both architectures, potentially inspiring future research to explore other ways of blending different network paradigms.
Future work could extend MobileViT in several directions:
Contrary to the general perception that ViTs are bulky and less efficient, MobileViT shows that with careful architectural design, transformers can be competitive in low-resource settings. However, some questions remain open, such as how well the MobileViT architecture generalizes across different domains without extensive augmentations or if further gains can be achieved by integrating different transformation techniques.
In conclusion, the MobileViT paper substantiates the necessity and feasibility of hybrid architecture in mobile vision applications. By effectively bridging the gap between CNNs and ViTs, MobileViT stands as a robust, versatile, and efficient architecture that pushes the boundaries of what can be achieved in resource-constrained environments.