Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 41 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 178 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Vision-by-Language for Training-Free Compositional Image Retrieval (2310.09291v2)

Published 13 Oct 2023 in cs.CV

Abstract: Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-LLMs (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with LLMs. By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Vqa: Visual question answering. In ICCV, 2015.
  2. Compositional learning of image-text query for image retrieval. In WACV, 2021.
  3. Sentence-level prompts benefit composed image retrieval. In ICLR, 2024.
  4. A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
  5. Effective conditioned and composed image retrieval combining clip-based features. In CVPR Workshops, 2022.
  6. Zero-shot composed image retrieval with textual inversion. In ICCV, 2023.
  7. Towards language models that can see: Computer vision through the lens of natural language. arXiv preprint arXiv:2306.16410, 2023.
  8. Cross modal retrieval with querybank normalisation. In CVPR, 2022.
  9. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  10. Language models are few-shot learners. NeurIPS, 2020.
  11. Broken neural scaling laws. In ICLR, 2023.
  12. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
  13. Learning joint visual semantic matching embeddings for language-guided retrieval. In ECCV, 2020.
  14. Image search with text feedback by visiolinguistic attention learning. In CVPR, 2020.
  15. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  17. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  18. ”this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
  19. ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
  20. A survey on in-context learning, 2023.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  22. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  23. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  24. Compodiff: Versatile composed image retrieval with latent diffusion. arXiv preprint arXiv:2303.11916, 2023.
  25. Automatic spatially-aware fashion concept discovery. In ICCV, 2017.
  26. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
  27. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In NeurIPS, 2023.
  28. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In ICCV, 2023.
  29. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In ACL Findings, 2023.
  30. Openclip. URL https://doi.org/10.5281/zenodo.5143773.
  31. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  32. Text encoders are performance bottlenecks in contrastive vision-language models. In EMNLP, 2023.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  34. Kg-sp: Knowledge guided simple primitives for open world compositional zero-shot learning. In CVPR, 2022.
  35. If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection. arXiv preprint arXiv:2305.13308, 2023.
  36. Cosmo: Content-style modulation for image retrieval with text feedback. In CVPR, 2021.
  37. Chatting makes perfect–chat-based image retrieval. arXiv preprint arXiv:2305.20062, 2023a.
  38. Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429, 2023b.
  39. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  40. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
  41. Microsoft coco: Common objects in context. In ECCV, 2014.
  42. Compositional visual generation with composable diffusion models. In ECCV, 2022.
  43. Zero-shot composed text-image retrieval. In BMVC, 2023.
  44. Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021.
  45. Open world compositional zero-shot learning. In CVPR, 2021.
  46. Visual classification via description from large language models. In ICLR, 2023.
  47. From red wine to red tomato: Composition with context. In CVPR, 2017.
  48. OpenAI. GPT-4 Technical Report. arXiv, abs/2303.08774, 2023.
  49. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. 2019.
  50. Learning to predict visual attributes in the wild. In CVPR, 2021.
  51. What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023.
  52. Learning transferable visual models from natural language supervision. In ICML, 2021.
  53. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  54. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  55. Integrating language guidance into vision-based deep metric learning. In CVPR, 2022a.
  56. Non-isotropy regularization for proxy-based deep metric learning. In CVPR, 2022b.
  57. Waffling around for performance: Visual classification with random words and broad concepts. In ICCV, 2023.
  58. Imagenet large scale visual recognition challenge. IJCV, 2015.
  59. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In CVPR, 2023.
  60. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  61. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  62. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  63. Flava: A foundational language and vision alignment model. In CVPR, 2022.
  64. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf.
  65. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  67. Sus-x: Training-free name-only transfer of vision-language models. In ICCV, 2023.
  68. Genecis: A benchmark for general conditional image similarity. In CVPR, 2023.
  69. Covr: Learning composed video retrieval from web video captions. In AAAI, 2024.
  70. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE T-PAMI, 2016.
  71. Composing text and image for image retrieval-an empirical odyssey. In CVPR, 2019.
  72. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. CVPR, 2021.
  73. Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, 2023.
  74. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  75. Socratic models: Composing zero-shot multimodal reasoning with language. In ICLR, 2023.
  76. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023.
Citations (30)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.