CAT: Cross Attention in Vision Transformer

Published 10 Jun 2021 in cs.CV and cs.AI | (2106.05786v1)

Abstract: Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones. The code and models are available at \url{https://github.com/linhezheng19/CAT}.

Abstract PDF Upgrade to Chat

Citations (108)

View on Semantic Scholar

Summary

The paper introduces CAT, a cross attention mechanism that splits processing into IPSA for local details and CPSA for global context.
It achieves an 82.8% top-1 accuracy on ImageNet-1K by efficiently reducing computational cost while maintaining strong performance.
The flexible design supports customizable trade-offs between efficiency and precision, making it adaptable to various vision tasks.

An Evaluation of the Cross Attention in Vision Transformer for Improved Image Processing

The paper introduces a novel attention mechanism named Cross Attention, designed to enhance the computational efficiency and effectiveness of Vision Transformers (ViTs) in handling computer vision tasks. Vision tasks have historically relied heavily on Convolutional Neural Networks (CNNs) for feature extraction owing to their proficiency in capturing local spatial hierarchies. However, with the emergence of the Transformer architecture, initially popularized through its application in NLP, the utility of Transformers in vision tasks has been gaining attention due to their global context capturing capabilities. The challenge has been, however, integrating Transformers efficiently into vision pipelines due to their high computational costs, primarily driven by the need to tokenize images into patches and apply self-attention across all tokens.

The Cross Attention Transformer (CAT), as detailed in the paper, proposes a strategic reduction in computational burden by introducing a two-level attention mechanism: Inner-Patch Self-Attention (IPSA) and Cross-Patch Self-Attention (CPSA). The IPSA mechanism captures local details by applying self-attention within individual patches, effectively reducing the computational scale from being quadratic in image dimension to the patch dimension. Complementarily, CPSA accounts for global contextual awareness by leveraging single-channel feature maps, focusing attention costs on capturing inter-patch pixel relations, rather than uniformly across all image tokens.

Reported experimental outcomes demonstrate that this bifurcated approach not only retains, but in some cases improves, the state-of-the-art benchmarks across multiple standard datasets. On the ImageNet-1K dataset, for example, the CAT achieves a top-1 accuracy of 82.8% in its base configuration, with notable performance gains when applied to downstream vision tasks on the COCO and ADE20K datasets. These results indicate CAT’s potential as a versatile backbone in varied vision applications, oscillating effectively between CNN-like and Transformer-like feature extractions.

The architectural customization options within CAT include varied depth and dimension configurations across its stages, allowing it to be fine-tuned to specific computational resource constraints or precision requirements. In practice, this flexible architecture can be configured to provide a balanced trade-off between accuracy and computational expense.

From a theoretical standpoint, the Cross Attention mechanism represents a step towards synergizing local and global information processing. The architecture provides a footprint that could inspire future variants that leverage both CNN and Transformer strengths without being hampered by prohibitive resource demands. Potential future developments in this paradigm could explore enhanced dynamic adaptability to diverse and high-resolution image inputs or further optimization of attention mechanisms to amplify transformer capabilities without entirely relinquishing CNN's intrinsic virtues.

In conclusion, the CAT framework indicates a promising frontier for integrating deep learning methodologies in computer vision, potentially influencing broadly how future vision models are structured. The Cross Attention mechanism, beyond delivering quantifiably improved results on established benchmarks, provides a novel conceptual basis for future research and practical implementations across advancing imaging systems.

Markdown Report Issue