Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Separable Self-attention for Mobile Vision Transformers (2206.02680v1)

Published 6 Jun 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

Citations (185)

Summary

  • The paper introduces a separable self-attention mechanism that reduces computational complexity from O(k²) to O(k).
  • It integrates the new approach into MobileViTv2, achieving 75.6% ImageNet accuracy while running 3.2 times faster on mobile devices.
  • Experimental results demonstrate improved performance in classification, segmentation, and detection for resource-constrained applications.

Separable Self-attention for Mobile Vision Transformers

The paper presented introduces a novel approach to addressing the computational inefficiencies of Mobile Vision Transformers (MobileViT), specifically focusing on reducing latency through separable self-attention mechanisms. MobileViT models have demonstrated state-of-the-art performance across several mobile vision tasks, including classification and detection. However, the efficiency bottleneck in these models lies within the multi-headed self-attention (MHA) component of transformers, which traditionally requires O(k2)O(k^2) time complexity concerning the number of tokens (or patches) kk.

Key Contributions

  1. Separable Self-attention: The primary contribution of this work is the proposal of a separable self-attention mechanism with linear time complexity, O(k)O(k). This approach replaces the costly batch-wise matrix multiplication with more efficient element-wise operations, making it well-suited for resource-constrained devices. The separable self-attention utilizes a latent token to compute context scores, which re-weights input tokens to encode global information efficiently.
  2. MobileViTv2: By integrating the separable self-attention into the MobileViT architecture, the authors present MobileViTv2. This new model achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming its predecessor by approximately 1% and running 3.2 times faster on mobile devices. The MobileViTv2 architecture scales efficiently across different complexities by using a width multiplier.
  3. Comparative Analysis: The separable self-attention was compared against traditional and Linformer-based self-attention. Results highlighted the efficiency of the proposed method both at module-level and architectural-level, showing significant improvements in speed without compromising accuracy.

Experimental Validation

The paper provides a thorough experimental validation across various tasks:

  • Object Classification: MobileViTv2 models outperformed existing transformer-based models and achieved performance levels comparable to CNN-based architectures while bridging the latency gap, particularly on resource-constrained devices.
  • Semantic Segmentation and Object Detection: The integration of MobileViTv2 into standard architectures such as PSPNet and DeepLabv3 demonstrated efficient performance on ADE20k and PASCAL VOC datasets. For object detection on MS-COCO, MobileViTv2 showed competitive results with significantly fewer parameters and FLOPs.

Implications and Future Work

The development of separable self-attention suggests promising implications for deploying transformer models on resource-constrained devices, such as mobile phones. It lowers the computational burden, thus extending the practical application of vision transformers in real-time scenarios. The approach could potentially be adapted and extended to other transformer-based architectures, such as those used in natural language processing, to enhance performance on devices with limited resources.

Future developments could explore further optimization of the separable self-attention mechanism, including investigating the utilization of multiple latent tokens or alternative projection strategies to push the boundaries of efficiency without trade-offs in performance. Additionally, there is potential for implementing hardware-specific optimizations that could further enhance the applicability of MobileViTv2 in diverse and constrained environments.

In summary, this work presents a significant step towards making vision transformers more viable for real-time applications on mobile and other resource-constrained platforms, opening up opportunities for expanded use and innovation in mobile computing.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com