Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

Published 5 May 2021 in cs.CV | (2105.02358v2)

Abstract: Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (408)

View on Semantic Scholar

Summary

The paper introduces external attention as an efficient alternative to self-attention, leveraging shared learnable memory to reduce computational costs.
It employs two linear layers with normalization to capture dataset-level correlations, yielding robust results across visual classification and segmentation tasks.
Empirical results demonstrate efficiency gains and competitive performance on benchmarks such as ImageNet, COCO, and PASCAL VOC.

Overview of "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks"

The presented paper introduces a new attention mechanism termed "external attention," which fundamentally alters how attention mechanisms are applied in deep learning for visual tasks. This novel approach seeks to resolve some limitations inherent in self-attention, notably its quadratic computational complexity and sample-specific dependency, by proposing an external memory-based architecture that offers linear complexity.

Key Concept: External Attention

The core idea behind external attention is leveraging two small, learnable memory units as a shared external knowledge base across samples. Unlike self-attention, which relies on affinities calculated within a sample, external attention calculates the affinities between input features and the external memory, significantly reducing computational costs and capturing dataset-level correlations. This approach replaces self-attention with a mechanism that is both more efficient and theoretically capable of learning more generalized features.

Implementation Details

External attention is implemented using two simple linear layers combined with normalization operations, simplifying integration into existing deep learning models. The memory units can be thought of as capturing the most informative aspects of the dataset, enabling the attention mechanism to focus on the most salient features while discarding noise. Moreover, the method supports a multi-head configuration to enhance the representational power of the model.

Empirical Results

The paper presents extensive experimental validation across multiple domains, including image classification, object detection, semantic segmentation, instance segmentation, and image generation. Specifically:

ImageNet Classification: External attention was incorporated into transformer architectures, yielding competitive accuracy with reduced computational demands.
COCO Detection and Segmentation: The method demonstrated improved accuracy over baseline detectors and segmentation models, proving its utility in tasks requiring fine-grained feature extraction.
Semantic Segmentation on PASCAL VOC and ADE20K: The approach matched or outperformed state-of-the-art methods, signifying its robustness in capturing spatial dependencies in pixel-level tasks.
Point Cloud Tasks: External attention also excelled in handling 3D data, offering promising alternatives to current best practices in point cloud processing.

Computational Efficiency

External attention shows a significant reduction in both parameter count and multiply-accumulate operations compared to self-attention and its variations. Such efficiency gains suggest potential applications in resource-limited environments or scenarios demanding rapid model inference.

Practical and Theoretical Implications

Practical implications of this research are far-reaching, offering a scalable alternative to self-attention in real-time applications and environments constrained by computational resources. Theoretically, the proposed mechanism could invigorate further investigations into external memory architectures and their applications in various machine learning tasks beyond computer vision.

Future Directions

Future developments may explore more sophisticated memory units, potential extensions to other domains (such as natural language processing), and hybrid attention models incorporating both internal and external elements. Moreover, understanding the theoretical underpinnings of how external attention compares to implicit memory architectures could lead to novel insights into learning dynamics in neural networks.

In conclusion, this paper provides a compelling shift in attention mechanisms, with external attention addressing some of self-attention's critical limitations while maintaining competitive performance. This work lays a solid foundation for subsequent explorations and enhancements in both theoretical and applied domains of artificial intelligence.

Markdown Report Issue