Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models (2311.17095v4)

Published 28 Nov 2023 in cs.CV and cs.AI

Abstract: From image-text pairs, large-scale vision-LLMs (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
  2. Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4253–4262, 2020.
  3. Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV), pages 1–9. IEEE, 2016.
  4. Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 2019.
  5. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018.
  6. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11165–11174, 2023.
  7. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 699–710, 2023.
  8. Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4288–4298, 2022a.
  9. Uniter: Universal image-text representation learning. In Proceedings of the 16th European Conference on Computer Vision, 2020.
  10. Class re-activation maps for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 969–978, 2022b.
  11. Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1201–1210, 2015.
  12. Weakly supervised object localization with multi-fold multiple instance learning. IEEE transactions on pattern analysis and machine intelligence, 39(1):189–203, 2016.
  13. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023.
  14. Weakly supervised semantic segmentation by pixel-to-prototype contrast. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4320–4329, 2022.
  15. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  16. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  17. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1921–1929, 2020.
  18. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877, 2023.
  19. A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7:87–93, 2018.
  20. Global knowledge calibration for fast open-vocabulary segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 797–807, 2023.
  21. A brief survey on semantic segmentation with deep learning. Neurocomputing, 406:302–321, 2020.
  22. Self-erasing network for integral object attention. Advances in Neural Information Processing Systems, 31, 2018.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021.
  24. L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16886–16896, 2022.
  25. Universal weakly supervised segmentation by pixel-to-segment contrastive learning. arXiv preprint arXiv:2105.00957, 2021.
  26. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021.
  27. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 695–711. Springer, 2016.
  28. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.
  29. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 3524–3533, 2017.
  30. Weakly supervised semantic segmentation using out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16897–16906, 2022.
  31. Language-driven semantic segmentation, 2022a.
  32. mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022b.
  33. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022c.
  34. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021a.
  35. Expansion and shrinkage of localization for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 35:16037–16051, 2022d.
  36. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022e.
  37. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  38. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021b.
  39. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV 2020, 2020.
  40. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  41. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  42. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In European Conference on Computer Vision, pages 275–292. Springer, 2022.
  43. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pages 23033–23044. PMLR, 2023.
  44. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
  45. Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19413–19423, 2023.
  46. A closer look at self-training for zero-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2693–2702, 2021.
  47. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19446–19455, 2023.
  48. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021a.
  49. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  50. Perceptual grouping in contrastive vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5571–5584, 2023.
  51. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307, 2023.
  52. Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16846–16855, 2022.
  53. Token contrast for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2023a.
  54. Token contrast for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2023b.
  55. Unsupervised joint object discovery and segmentation in internet images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1939–1946, 2013.
  56. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017.
  57. Reco: Retrieve and co-segment for zero-shot transfer. Advances in Neural Information Processing Systems, 35:33754–33767, 2022.
  58. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021.
  59. Simon Blanke. Gradient-Free-Optimizers: Simple and reliable optimization with local, global, population-based and sequential techniques in numerical search spaces. https://github.com/SimonBlanke, since 2020.
  60. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15638–15650, 2022.
  61. On regularized losses for weakly-supervised cnn segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 507–522, 2018.
  62. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. In Findings of the Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP), 2022.
  63. Unsupervised image matching and object discovery as optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8287–8296, 2019.
  64. Large-scale unsupervised object discovery. Advances in Neural Information Processing Systems, 34:16764–16778, 2021.
  65. Looking beyond single images for weakly supervised semantic segmentation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022a.
  66. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3124–3134, 2023.
  67. Self-supervised transformers for unsupervised object discovery using normalized cut. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14543–14553, 2022b.
  68. Unsupervised object discovery and co-localization by deep descriptor transformation. Pattern Recognition, 88:113–126, 2019.
  69. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1568–1576, 2017.
  70. A simple baseline for knowledge-based visual question answering. In EMNLP, 2023.
  71. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.
  72. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023a.
  73. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022a.
  74. Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2935–2944, 2023b.
  75. Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022b.
  76. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, pages 736–753. Springer, 2022c.
  77. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023c.
  78. Bridgetower: Building bridges between encoders in vision-language representation learning. arXiv preprint arXiv:2206.08657, 2022d.
  79. CoCa: Contrastive captioners are image-text foundation models. arXiv Preprint 2205.01917, 2022.
  80. Open-vocabulary semantic segmentation using test-time distillation. In European Conference on Computer Vision, pages 56–72. Springer, 2022.
  81. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276, 2021.
  82. X22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-vlm: All-in-one pre-trained model for vision-language tasks. arXiv preprint arXiv:2211.12402, 2022.
  83. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12765–12772, 2020.
  84. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
  85. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588, 2021.
  86. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1325–1334, 2018.
  87. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
  88. Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022a.
  89. Regional semantic contrast and aggregation for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4299–4309, 2022b.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com