Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation (2404.04231v1)
Abstract: This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-LLM, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
- Single-stage semantic segmentation from image labels. In CVPR, 2020.
- Natural language processing with Python: analyzing text with the natural language toolkit. 2009.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, 2023.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS, 2020.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
- The pascal visual object classes (voc) challenge. IJCV, 2010.
- Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst., 2020.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Open-vocabulary semantic segmentation with decoupled one-pass network. In ICCV, 2023.
- Diversity-aware meta visual prompting. In CVPR, 2023.
- Visual prompt tuning. In ECCV, 2022.
- Maple: Multi-modal prompt learning. In CVPR, 2023.
- Improving cross-modal retrieval with set of diverse embeddings. In CVPR, 2023.
- Stacked cross attention for image-text matching. In ECCV, 2018.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Language-driven semantic segmentation. In ICLR, 2022.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. ICML, 2023.
- Open-vocabulary semantic segmentationwith frozen vision-language models. In BMVC, 2022.
- The role of context for object detection and semantic segmentation in the wild. CVPR, 2014.
- A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
- Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. ICLR, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Reco: Retrieve and co-segment for zero-shot transfer. NIPS, 2022.
- Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR, 2019.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023.
- Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. ICCV, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. NIPS, 2021.
- Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In NIPS, 2023.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023b.
- Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, 2023.
- A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
- A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications, 2021.
- Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. NIPS, 2023.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Extract free dense labels from clip. In ECCV, 2022a.
- Conditional prompt learning for vision-language models. In CVPR, 2022b.
- Learning to prompt for vision-language models. IJCV, 2022c.
- Prompt-aligned gradient for prompt tuning. In ICCV, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.