Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively (2401.02955v2)
Abstract: The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the na\"{i}ve baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.
- Towards in-context scene understanding. arXiv preprint arXiv:2306.01667, 2023.
- BEiT: BERT pre-training of image transformers. In ICLR, 2022.
- Visual prompting via image inpainting. In NeurIPS, 2022.
- Language models are few-shot learners. In NeurIPS, 2020.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. arXiv preprint arXiv:2309.08842, 2023.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Vision transformer adapter for dense predictions. In ICLR, 2023.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Decoupling zero-shot semantic segmentation. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Unleashing vanilla vision transformer with masked image modeling for object detection. In ICCV, 2023.
- Explore in-context learning for 3d point cloud understanding. In NeurIPS, 2023.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2021.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Mask R-CNN. In ICCV, 2017.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Zero-shot recognition with unreliable attributes. In NeruIPS, 2014.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Learning open-world object proposals without learning to classify. RA-L, 2022.
- ViLT: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
- Panoptic segmentation. In CVPR, 2019.
- Segment anything. In ICCV, 2023.
- F-VLM: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In CVPR, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeruIPS, 2021.
- Transformer-based visual segmentation: A survey. arXiv pre-print, 2023.
- Semantic flow for fast and accurate scene parsing. In ECCV, 2020.
- OMG-Seg:is one model good enough for all segmentation? arXiv, 2023.
- Sfnet: Faster and accurate domain agnostic semantic segmentation via semantic flow. IJCV, 2023.
- Scaling language-image pre-training via masking. In CVPR, 2023.
- Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
- Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS, 2022.
- Microsoft COCO: Common objects in context. In ECCV, 2014.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Effective adapter for face recognition in the wild. arXiv preprint, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Class-agnostic object detection with multi-modal transformer. In ECCV, 2022.
- V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.
- Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
- High-quality entity segmentation. In ICCV, 2023.
- Open world entity segmentation. TPAMI, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Learning to retrieve prompts for in-context learning. arXiv:2112.08633, 2021.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Edadet: Open-vocabulary object detection using early dense alignment. In ICCV, 2023.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Sam-clip: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023.
- V3det: Vast vocabulary visual detection dataset. In ICCV, 2023.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
- Seggpt: Segmenting everything in context. In ICCV, 2023.
- Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
- Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation. In ICCV, 2023.
- Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880, 2023.
- Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
- Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023.
- Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023.
- Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. arXiv preprint arXiv:2309.13042, 2023.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
- Dst-det: Simple dynamic self-training for open-vocabulary object detection. arXiv pre-print, 2023.
- Panoptic video scene graph generation. In CVPR, 2023.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, 2022.
- Open-vocabulary DETR with conditional matching. In ECCV, 2022.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023.
- Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023.
- Fast segment anything, 2023.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
- Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam. arXiv preprint arXiv:2312.06660, 2023.
- Extract free dense labels from clip. In ECCV, 2022.
- Rethinking evaluation metrics of open-vocabulary segmentaion. arXiv preprint arXiv:2311.03352, 2023.
- Learning to prompt for vision-language models. IJCV, 2022.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
- Segment everything everywhere all at once. In NeurIPS, 2023.