Neighborhood Attention Transformer (2204.07143v5)

Published 14 Apr 2022 in cs.CV, cs.AI, and cs.LG

Abstract: We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding-window attention, we open source our project and release our checkpoints at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer .

Citations (197)

View on Semantic Scholar

Summary

The paper presents Neighborhood Attention, a novel mechanism that localizes attention to reduce computational complexity from quadratic to linear.
It introduces NATTEN, a Python package with efficient C++/CUDA implementations achieving up to a 40% speed increase and 25% memory reduction versus Swin Transformer.
The Neighborhood Attention Transformer outperforms comparable models in ImageNet accuracy and object detection benchmarks by efficiently expanding the receptive field.

Neighborhood Attention Transformer: Enhancing Computational Efficiency and Performance in Vision Transformers

The paper presents an innovative approach in the field of vision transformers with the introduction of Neighborhood Attention (NA), a novel attention mechanism designed to overcome the computational challenges associated with standard Self Attention (SA). Neighborhood Attention localizes attention to a restricted set of nearest neighboring pixels, significantly reducing the time and space complexity from quadratic to linear. The advancement addresses the inherent inefficiencies of global attention mechanisms on high-resolution image datasets common in vision tasks like object detection and segmentation.

Central to the paper is the Neighborhood Attention Extension (NATTEN), a Python package with efficient C++ and CUDA implementations that markedly accelerate NA's computational performance, achieving up to a 40% speed increase and a 25% reduction in memory usage compared to Swin Transformer's Window Self Attention (WSA). The implementation leverages the tiled NA algorithm, which maximizes parallel processing capabilities, thereby optimizing resource allocation on GPUs.

The authors propose the Neighborhood Attention Transformer (NAT) built upon the NA mechanism. NAT demonstrates exceptional performance across key vision benchmarks. Notably, NAT-Tiny outperforms Swin-Tiny by yielding a 1.9% improvement in ImageNet top-1 accuracy, along with notable gains in object detection and segmentation tasks as indicated by metrics such as MS-COCO mAP and ADE20K mIoU.

The research highlights the importance of translating equivariance—maintained by the NA pattern—offering an alternative to the more rigid window-based approaches. Unlike Swin's non-overlapping window approach, Neighborhood Attention effectively expands its receptive field without supplementary operations like pixel shifts. This leads to more efficient processing and underpins the significant throughput and memory improvement.

The implications of this research extend to multiple domains within computer vision. By challenging the assumption that window-based methods are inherently superior due to perceived efficiency, this work opens opportunities for further explorations into localized attention mechanisms that can rival or even surpass current state-of-the-art models. Future directions may include enhancing these approaches for real-time applications or further optimizing NATTEN to accommodate broader architectural frameworks and computational environments.

This paper significantly contributes to the ongoing development and refinement of transformer models in vision applications, presenting a new pathway for utilizing localized attention, which is both computationally efficient and scalable. The open-source release of NATTEN further encourages the research community to build upon this work, potentially leading to more widespread adoption and innovation in efficient attention mechanisms.

PDF Markdown

Related Papers

YouTube

Show All Videos