Fine-Grained Visual Prompting (2306.04356v2)
Abstract: Vision-LLMs (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
- Bridging the gap between object and image-level representations for open-vocabulary detection. NeurIPS, 2022.
- Visual prompting via image inpainting. NeurIPS, 2022.
- Text2live: Text-driven layered image and video editing. In ECCV, 2022.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
- Language models are few-shot learners. NeurIPS, 2020.
- End-to-end object detection with transformers. In ECCV, 2020.
- Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
- Uniter: Universal image-text representation learning. In ECCV, 2020.
- Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 2022.
- Eva: Exploring the limits of masked visual representation learning at scale. CVPR, 2023.
- Promptdet: Towards open-vocabulary detection using uncurated images. In ECCV, 2022.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Instance-level human parsing via part grouping network. In ECCV, 2018.
- Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR, 2017.
- Open-vocabulary object detection via vision and language knowledge distillation. ICLR, 2022.
- Mask r-cnn. In ICCV, 2017.
- Visual prompt tuning. In ECCV, 2022.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
- Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
- Panoptic segmentation. In CVPR, 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 1955.
- Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
- Grounded language-image pre-training. In CVPR, 2022.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
- Scaling language-image pre-training via masking. arXiv preprint arXiv:2212.00794, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- When does label smoothing help? arXiv preprint arXiv:1906.02629, 2019.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
- Paco: Parts and attributes of common objects. arXiv preprint arXiv:2301.01795, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
- Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991, 2022.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- The caltech-ucsd birds-200-2011 dataset. California Institute of Technology, 2011.
- Solo: Segmenting objects by locations. In ECCV, 2020.
- Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV, 2016.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
- Universal instance perception as object discovery and retrieval. arXiv preprint arXiv:2303.06674, 2023.
- Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
- Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
- Modeling context in referring expressions. In ECCV, 2016.
- Open-vocabulary detr with conditional matching. In ECCV, 2022.
- Part-based r-cnns for fine-grained category detection. In ECCV, 2014.
- Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021.
- Pyramid scene parsing network. In CVPR, 2017.
- Exploiting unlabeled data with vision and language models for object detection. In ECCV, 2022.
- Learning to prompt for vision-language models. IJCV, 2022.