Rethinking Local Perception in Lightweight Vision Transformer

Published 31 Mar 2023 in cs.CV | (2303.17803v5)

Abstract: Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces CloFormer, a lightweight vision transformer that uses the innovative AttnConv mechanism to capture high-frequency local details.
It employs a two-branch structure that efficiently integrates local and global features, enhancing tasks like image classification and segmentation.
Empirical results show competitive accuracy on ImageNet and COCO benchmarks while maintaining low computational overhead for mobile applications.

Insights into "Rethinking Local Perception in Lightweight Vision Transformer"

The paper "Rethinking Local Perception in Lightweight Vision Transformer" introduces CloFormer, a lightweight vision transformer model designed to effectively balance the computational demands and performance requirements suitable for mobile and low-resource applications. The authors address the significant performance degradation observed when Vision Transformers (ViTs) are scaled down for such environments, a barrier for their widespread mobile applicability.

Technical Contributions

The primary contribution of this paper is the development of CloFormer, which integrates a novel mechanism called AttnConv—a convolution operator in the style of attention. This operator is pivotal for effectively capturing high-frequency local information, which is essential for image classification, object detection, and semantic segmentation tasks.

AttnConv Mechanism:
- Shared and Context-Aware Weights: The paper explores the combination of globally shared weights utilized in convolutional layers, and token-specific context-aware weights found in attention mechanisms. This fusion enables AttnConv to capture high-frequency details effectively.
- Enhanced Nonlinearity: By employing a gating mechanism to generate context-aware weights, AttnConv introduces a stronger nonlinearity than typical attention mechanisms, potentially leading to enhanced feature representation.
Two-Branch Structure:
- The CloFormer architecture comprises a two-branch framework with a clear division of tasks—the local branch processes high-frequency information using AttnConv, while the global branch employs vanilla attention with pooling to handle low-frequency global information.
- This structure allows CloFormer to harness both local and global information efficiently, crucial for tasks that require multi-scale feature integration.

Empirical Evaluation

The effectiveness of CloFormer is demonstrated across multiple vision tasks:

Image Classification: On the ImageNet1K dataset, CloFormer variants achieve competitive top-1 accuracy while maintaining minimal parameter count and computational overhead. For instance, CloFormer-XXS attains 77.0% accuracy with only 4.2 million parameters and 0.6 GFLOPs.
Object Detection and Segmentation: In COCO object detection benchmark scenarios, CloFormer proves advantageous when evaluated with standard frameworks such as RetinaNet and Mask R-CNN, showing improvement in metrics like box AP and mask AP over existing lightweight models.
Semantic Segmentation: Tested on the ADE20K dataset, CloFormer again establishes superior performance in terms of mIoU, reinforcing its capability to handle dense prediction tasks competently.

Implications and Future Directions

The paper's insights highlight the potential to effectively reduce model size and improve computational efficiency without sacrificing accuracy, which presents substantial implications for deploying transformer architectures in edge and mobile environments. The methodological advancements proposed by the authors can inspire future research to further enhance local perception mechanisms in vision models, not only to refine existing applications but also unlock new utilities in constrained resource scenarios.

In conclusion, "Rethinking Local Perception in Lightweight Vision Transformer" advances the field by optimizing the balance of computational efficiency and model performance, setting a foundation for subsequent innovations in lightweight vision transformers. Further work might explore extending these methodologies to other domains or integrating additional efficiency-enhancing techniques, such as adaptive weight sharing and dynamic computation.

Markdown Report Issue