SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference (2312.01597v4)
Abstract: Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.
- Single-stage semantic segmentation from image labels. In CVPR, 2020.
- BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. NeurIPS, 2020.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- The pascal visual object classes challenge: A retrospective. IJCV, 2015.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Bootstrap your own latent a new approach to self-supervised learning. In NeurIPS, 2020.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Deep residual learning for image recognition. In CVPR, 2016.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Efficient inference in fully connected crfs with gaussian edge potentials. NeurIPS, 2011.
- Acseg: Adaptive conceptualization for unsupervised semantic segmentation. In CVPR, 2023.
- Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022b.
- Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, 2023.
- The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
- Open vocabulary semantic segmentation with patch aligned contrastive learning. In CVPR, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Combined scaling for open-vocabulary image classification. arXiv preprint arXiv:2111.10050, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
- Perceptual grouping in contrastive vision-language models. In ICCV, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
- Reco: Retrieve and co-segment for zero-shot transfer. NeurIPS, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Maxvit: Multi-axis vision transformer. In ECCV, 2022.
- Learning to decompose visual features with latent textual prompts. arXiv preprint arXiv:2210.04287, 2022.
- Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
- Crossformer++: A versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv:2303.06908, 2023.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
- Vision transformer with deformable attention. In CVPR, 2022.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023b.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023c.
- A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Ifseg: Image-free semantic segmentation via vision-language model. In CVPR, 2023.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Extract free dense labels from clip. In ECCV, 2022.
- Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134, 2021.
- Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023.
- Biformer: Vision transformer with bi-level routing attention. In CVPR, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.