Rethinking Spatial Dimensions of Vision Transformers

Published 30 Mar 2021 in cs.CV | (2103.16302v2)

Abstract: Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit

Abstract PDF Upgrade to Chat

Authors (6)

Citations (512)

View on Semantic Scholar

Summary

The paper presents PiT, integrating pooling layers to reduce spatial dimensions like CNNs, thus enhancing computational efficiency and performance.
The research demonstrates significant accuracy gains and robust defense against perturbations on benchmarks such as ImageNet.
The findings highlight that spatial reduction in transformers can lead to diversified attention patterns, paving the way for lightweight, robust vision models.

Rethinking Spatial Dimensions of Vision Transformers

The paper "Rethinking Spatial Dimensions of Vision Transformers" focuses on the spatial dimension design in Vision Transformers (ViTs), proposing a novel architectural modification named Pooling-based Vision Transformer (PiT). This research provides significant insights into how the spatial dimensions impact the performance of transformer-based models specifically designed for computer vision tasks.

Context and Motivation

Vision Transformers have emerged as strong competitors against Convolutional Neural Networks (CNNs) by leveraging the self-attention mechanism, which facilitates global interaction across image patches. However, unlike CNNs that undergo spatial dimension reduction as depth increases, typical ViTs maintain uniform spatial dimensions throughout. This study posits that adopting a spatial dimension reduction paradigm, similar to that utilized in CNNs, can enhance the efficacy of ViTs.

Architectural Contribution

The primary contribution of the paper is the introduction of PiT, which integrates pooling layers into the standard transformer architecture to emulate the dimension reduction found in CNNs. These pooling layers facilitate spatial size reduction and channel dimension increase, aiming to improve both computational efficiency and generalization performance.

Empirical Findings

Through comprehensive experiments, the authors demonstrate that PiT outperforms the baseline ViT architecture across various tasks, including image classification and object detection. Notably, the experiments reveal:

Improved Performance: PiT demonstrates superior model capability and generalization compared to ViT, particularly on standard benchmarks like ImageNet.
Enhanced Robustness: The modified architecture shows better robustness to input perturbations such as occlusion and adversarial attacks.
Attention Analysis: By examining attention matrices, the research indicates that spatial reduction leads to more diversified attention patterns, which could be preferable for visual processing.

Numerical Results and Comparisons

PiT is shown to achieve improved accuracy with reduced computational footprint compared to ViTs. For instance, on ImageNet classification, PiT achieves significant improvements in accuracy under identical training regimes without increasing model size or latency.

Theoretical and Practical Implications

The incorporation of spatial dimension reduction through pooling layers in transformers suggests a promising direction for enhancing vision-based transformer models. The PiT architecture provides a tangible framework that harmonizes the strengths of both CNNs and transformers, potentially influencing future architectural designs in vision-based AI research.

Future Directions

The promising performance of PiT opens avenues for developing lightweight transformer architectures that could be as efficient as traditional CNNs like MobileNet at lower model scales. Moreover, further optimization and exploration of pooling strategies could yield more nuanced solutions tailored for various vision tasks.

Conclusion

"Rethinking Spatial Dimensions of Vision Transformers" delivers a noteworthy exploration into the spatial configuration of transformer architectures for vision applications. By bridging the gap between the architectural paradigms of CNNs and transformers, this work underscores the importance of spatial operations in enhancing model performance, setting a new precedent in the ongoing evolution of deep learning architectures.

Markdown Report Issue