Separable Self-attention for Mobile Vision Transformers (2206.02680v1)

Published 6 Jun 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

Citations (185)

View on Semantic Scholar

Summary

The paper introduces a separable self-attention mechanism that reduces computational complexity from O(k²) to O(k).
It integrates the new approach into MobileViTv2, achieving 75.6% ImageNet accuracy while running 3.2 times faster on mobile devices.
Experimental results demonstrate improved performance in classification, segmentation, and detection for resource-constrained applications.

Separable Self-attention for Mobile Vision Transformers

The paper presented introduces a novel approach to addressing the computational inefficiencies of Mobile Vision Transformers (MobileViT), specifically focusing on reducing latency through separable self-attention mechanisms. MobileViT models have demonstrated state-of-the-art performance across several mobile vision tasks, including classification and detection. However, the efficiency bottleneck in these models lies within the multi-headed self-attention (MHA) component of transformers, which traditionally requires $O(k^2)$ time complexity concerning the number of tokens (or patches) $k$ .

Key Contributions

Separable Self-attention: The primary contribution of this work is the proposal of a separable self-attention mechanism with linear time complexity, $O(k)$ . This approach replaces the costly batch-wise matrix multiplication with more efficient element-wise operations, making it well-suited for resource-constrained devices. The separable self-attention utilizes a latent token to compute context scores, which re-weights input tokens to encode global information efficiently.
MobileViTv2: By integrating the separable self-attention into the MobileViT architecture, the authors present MobileViTv2. This new model achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming its predecessor by approximately 1% and running 3.2 times faster on mobile devices. The MobileViTv2 architecture scales efficiently across different complexities by using a width multiplier.
Comparative Analysis: The separable self-attention was compared against traditional and Linformer-based self-attention. Results highlighted the efficiency of the proposed method both at module-level and architectural-level, showing significant improvements in speed without compromising accuracy.

Experimental Validation

The paper provides a thorough experimental validation across various tasks:

Object Classification: MobileViTv2 models outperformed existing transformer-based models and achieved performance levels comparable to CNN-based architectures while bridging the latency gap, particularly on resource-constrained devices.
Semantic Segmentation and Object Detection: The integration of MobileViTv2 into standard architectures such as PSPNet and DeepLabv3 demonstrated efficient performance on ADE20k and PASCAL VOC datasets. For object detection on MS-COCO, MobileViTv2 showed competitive results with significantly fewer parameters and FLOPs.

Implications and Future Work

The development of separable self-attention suggests promising implications for deploying transformer models on resource-constrained devices, such as mobile phones. It lowers the computational burden, thus extending the practical application of vision transformers in real-time scenarios. The approach could potentially be adapted and extended to other transformer-based architectures, such as those used in natural language processing, to enhance performance on devices with limited resources.

Future developments could explore further optimization of the separable self-attention mechanism, including investigating the utilization of multiple latent tokens or alternative projection strategies to push the boundaries of efficiency without trade-offs in performance. Additionally, there is potential for implementing hardware-specific optimizations that could further enhance the applicability of MobileViTv2 in diverse and constrained environments.

In summary, this work presents a significant step towards making vision transformers more viable for real-time applications on mobile and other resource-constrained platforms, opening up opportunities for expanded use and innovation in mobile computing.

Related Papers

GitHub

GitHub - apple/ml-cvnets: CVNets: A library for training computer vision networks (1,705 stars)