Segment Everything Everywhere All at Once (2304.06718v4)
Abstract: In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like LLMs. More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from decoder to image features; and iv) Semantic-awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. Notably, our single SEEM model achieves competitive performance across interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity for generalization to novel prompts or their combinations, rendering it a readily universal image segmentation interface.
- Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5463–5474, 2021.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021.
- Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022.
- Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022.
- Side adapter network for open-vocabulary semantic segmentation. arXiv preprint arXiv:2302.12242, 2023.
- Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1271–1280, 2017.
- Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10502–10511, 2019.
- Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022.
- A unified mutual supervision framework for referring expression segmentation and generation. arXiv preprint arXiv:2211.07919, 2022.
- Polyformer: Referring image segmentation as sequential polygon generation. 2023.
- Reviving iterative training with mask guidance for interactive segmentation, 2021.
- Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021.
- Simpleclick: Interactive image segmentation with simple vision transformers. arXiv preprint arXiv:2210.11006, 2022.
- Focalclick: towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
- A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023.
- Lazy snapping. ACM Transactions on Graphics (ToG), 23(3):303–308, 2004.
- Leo Grady. Random walks for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 28(11):1768–1783, 2006.
- Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373–381, 2016.
- Segment anything, 2023.
- King-Sun Fu and JK Mui. A survey on image segmentation. Pattern recognition, 13(1):3–16, 1981.
- Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2009.
- Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055, 2019.
- Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 2021.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9157–9166, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1280–1289, 2022.
- Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:2211.06220, 2022.
- Mp-former: Mask-piloted transformer for image segmentation. arXiv preprint arXiv:2303.07336, 2023.
- Universal instance perception as object discovery and retrieval. arXiv preprint arXiv:2303.06674, 2023.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
- Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
- Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022.
- Pseudoclick: Interactive image segmentation with click imitation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 728–745. Springer, 2022.
- UViM: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
- A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
- Images speak in images: A generalist painter for in-context visual learning, 2023.
- Focal modulation networks. arXiv preprint arXiv:2203.11926, 2022.
- Davit: Dual attention vision transformers. arXiv preprint arXiv:2204.03645, 2022.
- Unified contrastive learning in image-text-label space. In CVPR, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Focuscut: Diving into a focus view in interactive segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2637–2646, 2022.
- Agss-vos: Attention guided single-shot video object segmentation. In ICCV, 2019.
- A generative appearance model for end-to-end video object segmentation, 2018.
- Swem: Towards real-time video object segmentation with sequential weighted expectation-maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1362–1372, 2022.
- XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
- Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019.
- Language as queries for referring video object segmentation. arXiv preprint arXiv:2201.00487, 2022.
- Track anything: Segment anything meets videos, 2023.
- Personalize segment anything model with one shot, 2023.
- Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.