Rethinking and Improving Relative Position Encoding for Vision Transformer

Published 29 Jul 2021 in cs.CV | (2107.14222v1)

Abstract: Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (296)

View on Semantic Scholar

Summary

The paper introduces novel image-specific RPE methods that improve self-attention mechanisms for vision transformers.
It systematically adapts 1D NLP RPE techniques to 2D image data, enhancing spatial modeling efficiency.
Empirical evaluations show up to 1.5% top-1 accuracy and 1.3% mAP gains on benchmarks without additional tuning.

Evaluation and Novel Contributions in Relative Position Encoding for Vision Transformers

The paper under discussion, "Rethinking and Improving Relative Position Encoding for Vision Transformer," addresses the distinct yet unexplored effectiveness of Relative Position Encoding (RPE) in Vision Transformer architectures, particularly in contrast to its well-acknowledged applicability in NLP. In recognition of the gap between utilization and understanding of relative and absolute positional encoding in visual tasks, the authors embark on a systematic exploration of existing methodologies and propose new variants tailored specifically for vision transformers.

The study is segmented into stages that review and evaluate existing RPE methods from NLP for applicability in vision transformers, analyze potential issues, and introduce new image-specific RPE methods. These proposed methods account for directional relative distances and interactions among queries, keys, and values in self-attention mechanisms. This reevaluation is notably significant given the intricate spatial dependencies characteristic of image data compared to textual data.

Key Contributions and Methodologies

Analytical Synthesis of RPE: The authors thoroughly analyze several prior implementations of RPE that were predominantly designed for 1D textual inputs, transitioning these into the 2D field of image data. Foremost, Shaw's RPE, Transformer-XL's adaptation, and other variants are scrutinized to delineate their pros and cons within vision frameworks.
Proposal of Image RPE (iRPE): Extending beyond existing paradigms, the authors introduce lightweight RPE methods explicitly designed for 2D image data. These methods pivot on directional modeling and self-attention module interactions. The image RPE (iRPE) not only maintains simplicity and efficiency but also yields substantial performance enhancements.
Empirical Verification: A series of empirically driven evaluations reveals definitive improvements. The inclusion of proposed iRPE methods yields up to a 1.5% increase in top-1 accuracy over baseline models like DeiT and a 1.3% gain in mean Average Precision (mAP) on established datasets such as ImageNet and COCO, sans hyperparameter tuning.
Efficient Computational Implementation: The paper introduces an efficient indexing mechanism reducing computational complexity from $\mathcal{O}(n^2d)$ to $\mathcal{O}(nkd)$ , where $k \ll n$ . This is particularly relevant for high-resolution image inputs prominent in object detection.

Experimental Insights

The experimental findings substantiate that relative position encoding can effectively substitute absolute encoding in image classification tasks; however, the latter remains crucial for object detection due to its necessity in accurate spatial localization. Furthermore, directed encoding methodologies—'Cross' and 'Product'—yield superior results, highlighting the importance of directional information in structured data.

Implications and Future Directions

The advantages demonstrated by iRPE suggest fertile ground for further research into position encoding mechanisms tailored to different types of vision tasks. The results provide a compelling argument for continued exploration of the balance between absolute and relative encoding mechanisms tailored to distinct task requirements.

Future research avenues could explore extending the proposed framework to other attention-driven models beyond vision tasks to examine the transversal applicability of iRPE in diverse data modalities, especially with the growing ubiquity of transformer-based architectures. Additionally, refining current methods to reduce complexity further while preserving the precision of encoding can further improve performance on resource-constrained platforms.

In conclusion, this paper embodies a crucial step forward in demystifying the complexities of positional encoding within vision transformers, offering practical, validated solutions, and sparking potential inquiry into bespoke adaptations of transformer architectures and their applications.

Markdown Report Issue