Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation (2404.04231v1)
Abstract: This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-LLM, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
- Single-stage semantic segmentation from image labels. In CVPR, 2020.
- Natural language processing with Python: analyzing text with the natural language toolkit. 2009.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, 2023.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS, 2020.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
- The pascal visual object classes (voc) challenge. IJCV, 2010.
- Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst., 2020.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Open-vocabulary semantic segmentation with decoupled one-pass network. In ICCV, 2023.
- Diversity-aware meta visual prompting. In CVPR, 2023.
- Visual prompt tuning. In ECCV, 2022.
- Maple: Multi-modal prompt learning. In CVPR, 2023.
- Improving cross-modal retrieval with set of diverse embeddings. In CVPR, 2023.
- Stacked cross attention for image-text matching. In ECCV, 2018.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Language-driven semantic segmentation. In ICLR, 2022.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. ICML, 2023.
- Open-vocabulary semantic segmentationwith frozen vision-language models. In BMVC, 2022.
- The role of context for object detection and semantic segmentation in the wild. CVPR, 2014.
- A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
- Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. ICLR, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Reco: Retrieve and co-segment for zero-shot transfer. NIPS, 2022.
- Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR, 2019.
- Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
- Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023.
- Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. ICCV, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. NIPS, 2021.
- Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In NIPS, 2023.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023b.
- Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, 2023.
- A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
- A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications, 2021.
- Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. NIPS, 2023.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Extract free dense labels from clip. In ECCV, 2022a.
- Conditional prompt learning for vision-language models. In CVPR, 2022b.
- Learning to prompt for vision-language models. IJCV, 2022c.
- Prompt-aligned gradient for prompt tuning. In ICCV, 2023.
- Ji-Jia Wu (2 papers)
- Andy Chia-Hao Chang (1 paper)
- Chieh-Yu Chuang (1 paper)
- Chun-Pei Chen (1 paper)
- Yu-Lun Liu (35 papers)
- Min-Hung Chen (41 papers)
- Hou-Ning Hu (9 papers)
- Yung-Yu Chuang (16 papers)
- Yen-Yu Lin (38 papers)