LightViT: Towards Light-Weight Convolution-Free Vision Transformers (2207.05557v1)

Published 12 Jul 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is available at https://github.com/hunto/LightViT.

Citations (50)

View on Semantic Scholar

Summary

The paper introduces a pure transformer model, LightViT, that eliminates convolutions for efficient vision processing.
It employs global aggregation tokens and a bi-dimensional attention module to enhance spatial and channel feature learning.
Experimental evaluation shows 78.7% ImageNet accuracy at 0.7G FLOPs, outperforming comparable convolution-based models.

LightViT: Advancement in Convolution-Free Vision Transformers

The paper "LightViT: Towards Light-Weight Convolution-Free Vision Transformers" introduces a novel approach to enhancing the efficiency of Vision Transformers (ViTs) by eliminating convolutional components entirely. The authors propose the LightViT model, which aims to achieve an improved accuracy-efficiency balance while simplifying the architecture to rely solely on pure transformer blocks. The central innovation lies in introducing novel aggregation schemes that enable ViTs to perform effectively without incorporating convolutional operations.

Key Contributions

The fundamental contributions of this research can be distilled into several key components:

Global Aggregation Tokens: The model introduces learnable global tokens within the self-attention framework. These tokens aggregate information from local tokens across the image, capturing and redistributing global dependencies to local features. This method establishes a simplified yet efficient manner of sharing information without the need for convolutional kernels.
Bi-dimensional Attention Module in FFN: The network's feed-forward component incorporates a bi-dimensional attention mechanism. It explicitly models spatial and channel dependencies to enhance the representational capacity, especially crucial for lightweight models constrained by limited channel dimensions.
Architectural Efficiency: LightViT demonstrates the removal of early-stage convolutions, opting for a hierarchical model structure with fewer stages to enhance computational throughput. The design choices favor pragmatic efficiency, utilizing modifications like residual patch merging to maintain performance without incurring significant computational costs.

Experimental Evaluation

This research rigorously evaluates LightViT across several prominent computer vision benchmarks, including image classification on ImageNet and object detection on the MS-COCO dataset. Notably, the LightViT-T configuration achieves an impressive accuracy of 78.7% on ImageNet using just 0.7G FLOPs, outperforming comparable models like PVTv2-B0. The model also exhibits a 14% faster inference time with marginally smaller FLOPs compared to traditional models such as ResT-Small.

Implications and Future Prospects

The implications of this research are notable for both theoretical and practical dimensions. The elimination of convolutions opens considerations for how transformers, in their pure form, may evolve to become the modular backbone for various vision tasks traditionally dominated by CNNs. Additionally, the deployment of LightViT in environments where computational resources are a premium could offer significant cost-saving and performance-enhancing opportunities.

Future developments could explore further optimization of token aggregation schemes and expand the investigation into areas where no prior inductive biases are advantageous. Understanding edge cases or specific tasks where convolutional components still offer irreplaceable benefits may carve pathways for hybrid models that integrate the best of both paradigms.

Conclusion

Overall, this work contributes significantly to the ongoing discourse on the trade-offs between architectural simplicity and performance in deep learning models. By removing convolution and demonstrating resilient performance under constrained computational budgets, the LightViT model underscores an innovative step towards achieving efficient pure-transformer architectures. Such strides indicate the potential for transforming how vision tasks can be tackled using novel architectures free from traditional constraints.

PDF Markdown

Related Papers

GitHub

GitHub - hunto/LightViT: Official implementation for paper "LightViT: Towards Light-Weight Convolution-Free Vision Transformers" (140 stars)