Agent Attention: On the Integration of Softmax and Linear Attention (2312.08874v3)
Abstract: The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.
- Token merging for fast stable diffusion. In CVPRW, 2023.
- Hydra attention: Efficient attention with many heads. In ECCVW, 2022.
- Token merging: Your ViT but faster. In ICLR, 2023.
- Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Rethinking attention with performers. In ICLR, 2021.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- 8-bit optimizers via block-wise quantization. In ICLR, 2022.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Flatten transformer: Vision transformer using focused linear attention. In ICCV, 2023a.
- Dynamic perceiver for efficient visual recognition. In ICCV, 2023b.
- Neighborhood attention transformer. In CVPR, 2023.
- Mask r-cnn. In ICCV, 2017.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.
- Panoptic feature pyramid networks. In CVPR, 2019.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Focal loss for dense object detection. In ICCV, 2017.
- Pseudo numerical methods for diffusion models on manifolds. In ICLR, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Decoupled weight decay regularization. In ICLR, 2018.
- Soft: Softmax-free transformer with linear complexity. In NeurIPS, 2021.
- Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Maximilian Seitzer. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid, 2020. Version 0.3.0.
- Self-attention with relative position representations. In ACL, 2018.
- Efficient attention: Attention with linear complexities. In WACV, 2021.
- Denoising diffusion implicit models. In ICLR, 2021.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022.
- Vision transformer with deformable attention. In CVPR, 2022.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, 2021.
- Castling-vit: Compressing self-attention via switching towards linear-angular attention at vision transformer inference. In CVPR, 2023.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Random erasing data augmentation. In AAAI, 2020.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Biformer: Vision transformer with bi-level routing attention. In CVPR, 2023.