Parrot Captions Teach CLIP to Spot Text (2312.14232v3)
Abstract: Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.
- Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.
- A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095, 2023.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
- Opendatalab: Empowering general artificial intelligence with open datasets. https://opendatalab.com, 2022.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Openclip, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Language-biased image classification: evaluation based on semantic representations. arXiv preprint arXiv:2201.11014, 2022.
- Language-driven semantic segmentation. In ICLR, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- T-mars: Improving visual representations by circumventing text feature learning. arXiv preprint arXiv:2307.03132, 2023.
- Disentangling visual and written concepts in clip. In CVPR, pages 16410–16419, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830, 2011.
- Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
- Filtering, distillation, and hard negatives for vision-language pre-training. In CVPR, pages 6967–6977, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- Logoprompt: Synthetic text images can be good visual prompts for vision-language models. In ICCV, pages 2932–2941, 2023.
- What does clip know about a red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
- Alexandru Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, 9(1):23–34, 2004.
- Clippo: Image-and-language understanding from pixels only. In CVPR, pages 11006–11017, 2023.
- Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. arXiv preprint arXiv:2109.05433, 2021.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Devil in the number: Towards robust multi-modality data filter. arXiv preprint arXiv:2309.13770, 2023.
- Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
- Deepsolo: Let transformer decoder with explicit points solo for text spotting. In CVPR, pages 19348–19357, 2023.
- When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133, 2022.
- Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV, pages 1–22, 2023.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.