CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models (2402.15021v2)
Abstract: Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-LLMs (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc.
- Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
- Improving image generation with better captions.
- Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.
- On the opportunities and risks of foundation models. ArXiv.
- COYO-700M: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset.
- Santiago Castro and Fabian Caba. 2022. Fitclip: Refining large-scale pretrained image-text models for zero-shot video understanding tasks. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press.
- Scalable performance analysis for vision-language models. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 284–294, Toronto, Canada. Association for Computational Linguistics.
- Microsoft COCO Captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- tqdm: A fast, Extensible Progress Bar for Python and CLI.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Why is winoground hard? investigating failures in visuolinguistic compositionality. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2236–2250, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Christiane Fellbaum. 2010. Theory and Applications of Ontology: Computer Applications, chapter WordNet. Springer Netherlands, Dordrecht.
- Compositionality in visual perception. Behavioral and Brain Sciences, 46:e277.
- Array programming with NumPy. Nature, 585(7825):357–362.
- Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pages 204–207. IEEE.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.
- Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3635–3644, Online. Association for Computational Linguistics.
- spaCy: Industrial-strength Natural Language Processing in Python.
- SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- John D Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(03):90–95.
- Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems, volume 35, pages 29262–29277. Curran Associates, Inc.
- Openclip. If you use this software, please cite it as below.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
- CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Jupyter Notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87–90, Netherlands. IOS Press.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops.
- Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto.
- HMDB: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563.
- OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- The MNIST database of handwritten digits.
- Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Fortieth International Conference on Machine Learning.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
- Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing.
- Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10910–10921.
- TorchVision: PyTorch’s computer vision library. https://github.com/pytorch/vision.
- RareAct: A video dataset of unusual interactions. arXiv preprint arXiv:2008.01018.
- Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- OpenAI. 2023. GPT-4V(ision) System Card. Technical report, OpenAI.
- VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, Dublin, Ireland. Association for Computational Linguistics.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
- Fernando Pérez and Brian E. Granger. 2007. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21–29.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
- Cola: A benchmark for compositional text-to-image retrieval. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- LAION COCO: 600M synthetic captions from LAION2B-EN.
- LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
- FOIL it! find one mismatch between image and language caption. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 255–265, Vancouver, Canada. Association for Computational Linguistics.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01.
- Robyn Speer. 2019. ftfy. Zenodo. Version 5.5.
- Ole Tange. 2011. GNU Parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47.
- The Pandas development team. 2023. pandas-dev/pandas: Pandas.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5238–5248.
- Image captioners are scalable vision learners too. In Thirty-seventh Conference on Neural Information Processing Systems.
- Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods, 17(3):261–272.
- Michael L. Waskom. 2021. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021.
- Ross Wightman. 2019. PyTorch image models. https://github.com/rwightman/pytorch-image-models.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Omry Yadan. 2019. Hydra – a framework for elegantly configuring complex applications. Github.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations.
- VL-CheckList: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221.
- Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Santiago Castro (14 papers)
- Amir Ziai (11 papers)
- Avneesh Saluja (7 papers)
- Zhuoning Yuan (14 papers)
- Rada Mihalcea (131 papers)