VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (2309.01141v4)
Abstract: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- OpenAI, “GPT-4 technical report,” 2023.
- “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
- “LAION-5B: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
- “Imagic: Text-based real image editing with diffusion models,” in CVPR, 2023.
- “SmartBrush: Text and shape guided object inpainting with diffusion model,” in CVPR, 2023.
- “VinVL: Revisiting visual representations in vision-language models,” in CVPR, 2021.
- “Dap: Domain-aware prompt learning for vision-and-language navigation,” arXiv preprint arXiv:2311.17812, 2023.
- “Modeling context in referring expressions,” in ECCV, 2016.
- “MAttNet: Modular attention network for referring expression comprehension,” in CVPR, 2018.
- “Language adaptive weight generation for multi-task visual grounding,” in CVPR, 2023.
- “DQ-DETR: Dual query detection transformer for phrase extraction and grounding,” in AAAI, 2023.
- “Zero-shot referring image segmentation with global-local context features,” in CVPR, 2023.
- “Your diffusion model is secretly a zero-shot classifier,” in ICCV, 2023.
- “Unleashing text-to-image diffusion models for visual perception,” in ICCV, 2023.
- “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv preprint arXiv:2306.09316, 2023.
- “Discriminative diffusion models as few-shot vision and language learners,” arXiv preprint arXiv:2305.10722, 2023.
- “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016.
- “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
- “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
- “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
- “CPT: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
- “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
- “Microsoft COCO: Common objects in context,” in ECCV, 2014.
- “An improved non-monotonic transition system for dependency parsing,” in EMNLP, 2015.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.