Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation (2404.04231v1)

Published 5 Apr 2024 in cs.CV

Abstract: This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-LLM, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Single-stage semantic segmentation from image labels. In CVPR, 2020.
  2. Natural language processing with Python: analyzing text with the natural language toolkit. 2009.
  3. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  4. Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, 2023.
  5. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023.
  6. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  7. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  8. Probabilistic embeddings for cross-modal retrieval. In CVPR, 2021.
  9. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
  10. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  11. Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS, 2020.
  12. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
  13. The pascal visual object classes (voc) challenge. IJCV, 2010.
  14. Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
  15. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst., 2020.
  16. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  17. Open-vocabulary semantic segmentation with decoupled one-pass network. In ICCV, 2023.
  18. Diversity-aware meta visual prompting. In CVPR, 2023.
  19. Visual prompt tuning. In ECCV, 2022.
  20. Maple: Multi-modal prompt learning. In CVPR, 2023.
  21. Improving cross-modal retrieval with set of diverse embeddings. In CVPR, 2023.
  22. Stacked cross attention for image-text matching. In ECCV, 2018.
  23. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  24. Language-driven semantic segmentation. In ICLR, 2022.
  25. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  26. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
  27. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023.
  28. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
  29. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  30. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. ICML, 2023.
  31. Open-vocabulary semantic segmentationwith frozen vision-language models. In BMVC, 2022.
  32. The role of context for object detection and semantic segmentation in the wild. CVPR, 2014.
  33. A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023.
  34. Learning transferable visual models from natural language supervision. In ICML, 2021.
  35. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, 2022.
  36. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. ICLR, 2023.
  37. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  38. Reco: Retrieve and co-segment for zero-shot transfer. NIPS, 2022.
  39. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR, 2019.
  40. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  41. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023.
  42. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. ICCV, 2023.
  43. Segformer: Simple and efficient design for semantic segmentation with transformers. NIPS, 2021.
  44. Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In NIPS, 2023.
  45. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
  46. Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
  47. Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023b.
  48. Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, 2023.
  49. A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
  50. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Systems with Applications, 2021.
  51. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. NIPS, 2023.
  52. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  53. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  54. Extract free dense labels from clip. In ECCV, 2022a.
  55. Conditional prompt learning for vision-language models. In CVPR, 2022b.
  56. Learning to prompt for vision-language models. IJCV, 2022c.
  57. Prompt-aligned gradient for prompt tuning. In ICCV, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ji-Jia Wu (2 papers)
  2. Andy Chia-Hao Chang (1 paper)
  3. Chieh-Yu Chuang (1 paper)
  4. Chun-Pei Chen (1 paper)
  5. Yu-Lun Liu (35 papers)
  6. Min-Hung Chen (41 papers)
  7. Hou-Ning Hu (9 papers)
  8. Yung-Yu Chuang (16 papers)
  9. Yen-Yu Lin (38 papers)
Citations (5)

Summary

  • The paper proposes the CoDe framework, aligning image regions with word segments to bridge the gap between training and testing in semantic segmentation.
  • It employs contrastive learning and learnable prompts to effectively map decomposed image and text components, enhancing feature extraction.
  • Experimental results on six benchmarks show superior performance, particularly in zero-shot scenarios, validating the framework's innovative approach.

Towards Accurate Text-Supervised Semantic Segmentation via Image-Text Co-Decomposition

Introduction

Semantic segmentation has been a critical task enabling various applications in computer vision, such as autonomous driving, robotic navigation, and medical image analysis. Traditionally, fully supervised methods have reigned supreme, primarily hindered by the high costs associated with obtaining pixel-level annotations. In response, text-supervised semantic segmentation has emerged as an appealing avenue, leveraging the abundant image-text pairs available on the internet to overcome the annotation bottleneck. This work introduces a novel framework, CoDe (Co-Decomposition), that fundamentally addresses the limitations inherent in existing text-supervised methods by facilitating an alignment between image regions and word segments to significantly enhance semantic segmentation performance.

Methodology

Image-Text Co-Decomposition (CoDe) Framework

CoDe establishes a direct correspondence between image and text by decomposing them into semantically relevant regions and word segments, respectively. This decomposition is achieved by employing a visual-LLM tailored for segmenting an image into regions and segregating a text into word segments based on selected keywords (nouns). This direct mapping between decomposed image regions and corresponding word segments drastically reduces the discrepancy between training and testing phases, peculiar to prior methods.

  • Image and Text Segmenters: Functioning as the core components, these segmenters are designed for deducing semantically coherent regions from images and word segments from text, respectively, facilitated by the selection of nouns from the text.
  • Region-Word Alignment Module: By leveraging contrastive learning, a consensus between the image and word segments is enforced, strategically reducing the gap prevalent in earlier methods between the training objective and the actual task of semantic segmentation.

Highlighted Region and Word Prompts

Given the challenge of applying vision-LLMs to masked inputs—stemming from the practice of highlighting certain segments by masking others—CoDe introduces learnable prompts that serve to maintain effective feature extraction capabilities. These prompts essentially provide a consistent input structure to the vision-LLM, thereby enhancing the alignment between the highlighted regions and words.

Experimental Validation

An extensive evaluation of the CoDe framework was carried out on six benchmark datasets, demonstrating its superior performance over existing text-supervised semantic segmentation methods. Notably, CoDe achieved favorable performance against state-of-the-art methods in zero-shot semantic segmentation, further supported by comprehensive ablation studies highlighting the effectiveness of its individual components and the overall approach.

Implications and Future Directions

The CoDe framework marks a significant step forward in text-supervised semantic segmentation, resolving critical discrepancies that have historically limited the efficacy of such methods. By establishing a robust mechanism for aligning image regions with word segments, CoDe not only enhances the segmentation accuracy but also opens up new possibilities for zero-shot and few-shot learning in the context of semantic segmentation. Future work could explore the integration of CoDe's principles with other tasks in vision-language processing, potentially leading to more generalized models capable of understanding and processing complex visual-textual relationships across a broader spectrum of applications.

Conclusion

The introduction of the CoDe framework represents a pivotal advancement in the field of text-supervised semantic segmentation, addressing longstanding challenges through innovative methodology. By fostering a more accurate and direct alignment between image regions and textual descriptors, CoDe sets a new standard for the development of efficient and robust semantic segmentation models, significantly broadening the horizons for research and applications in computer vision and beyond.