Modeling Caption Diversity in Contrastive Vision-Language Pretraining (2405.00740v4)
Abstract: There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.
- Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pp. 456–473. Springer, 2022.
- Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15619–15629, 2023.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022.
- Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024.
- Food-101 – Mining Discriminative Components with Random Forests. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision – ECCV 2014, Lecture Notes in Computer Science, pp. 446–461, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10599-4. 10.1007/978-3-319-10599-4_29.
- Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pp. 104–120. Springer, 2020b.
- Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proceedings of the IEEE, 105(10):1865–1883, October 2017. ISSN 0018-9219, 1558-2256. 10.1109/JPROC.2017.2675998.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- Describing Textures in the Wild, November 2013.
- An Analysis of Single-Layer Networks in Unsupervised Feature Learning, 2010.
- Vision transformers need registers, 2023.
- Hyperbolic image-text representations, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone, November 2022.
- Data filtering networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KAk6ngZ09F.
- Improved baselines for vision-language pre-training. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=a7nvXxNmdV. Featured Certification.
- Foucault, M. Les mots et les choses. Gallimard Paris, 1990.
- Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, 2022.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012. 10.1109/CVPR.2012.6248074.
- EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification, February 2019.
- On feature decorrelation in self-supervised learning. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9578–9588, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society. 10.1109/ICCV48922.2021.00946. URL https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00946.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Discovering states and transformations in image collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- Scaling up visual and vision-language representation learning with noisy text supervision, 2021.
- Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=YevsQ05DEN7.
- Adam: A method for stochastic optimization, 2017.
- Collecting a Large-Scale Dataset of Fine-Grained Cars, 2013.
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images, 2010.
- LeCun, Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27, 2022.
- MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010. URL http://yann.lecun.com/exdb/mnist/.
- The Caltech-UCSD Birds-200-2011 Dataset. https://authors.library.caltech.edu/records/cvm3y-5hh21, 2003.
- Align before fuse: Vision and language representation learning with momentum distillation, 2021.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, February 2022a.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, June 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023c.
- Grounded language-image pre-training, 2022b.
- Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020.
- Clipa-v2: Scaling clip training with 81.1
- Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400, 2023e.
- Microsoft coco: Common objects in context. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision – ECCV 2014, pp. 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1.
- Improved baselines with visual instruction tuning, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Fine-Grained Visual Classification of Aircraft, June 2013.
- Self-supervised learning of pretext-invariant representations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6706–6716, Los Alamitos, CA, USA, jun 2020. IEEE Computer Society. 10.1109/CVPR42600.2020.00674. URL https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.00674.
- Text-to-concept (and back) via cross-model alignment, 2023.
- Anymal: An efficient and scalable any-modality augmented language model, 2023.
- Slip: Self-supervision meets language-image pre-training, 2021.
- Automated Flower Classification over a Large Number of Classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729, Bhubaneswar, India, December 2008. IEEE. 10.1109/ICVGIP.2008.47.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3498–3505, June 2012. 10.1109/CVPR.2012.6248092.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. CoRR, abs/2007.13916, 2020. URL https://arxiv.org/abs/2007.13916.
- Learning Transferable Visual Models From Natural Language Supervision, February 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389–5400. PMLR, 2019.
- ImageNet Large Scale Visual Recognition Challenge, January 2015.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. URL http://arxiv.org/abs/1212.0402.
- The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In IEEE International Joint Conference on Neural Networks, pp. 1453–1460, 2011.
- Eva-clip: Improved training techniques for clip at scale, 2023.
- Attention is all you need, 2023.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022a.
- SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=GUrhfTuf_3.
- SUN database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492, June 2010. 10.1109/CVPR.2010.5539970.
- Demystifying CLIP Data, October 2023.
- Weakly supervised lesion localization with probabilistic-cam pooling. ArXiv, abs/2005.14480, 2020. URL https://api.semanticscholar.org/CorpusID:215776849.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10.1162/tacl_a_00166. URL https://aclanthology.org/Q14-1006.
- Coca: Contrastive captioners are image-text foundation models, 2022.
- Lit: Zero-shot transfer with locked-image text tuning, 2022.
- Sigmoid Loss for Language Image Pre-Training, September 2023.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5579–5588, June 2021.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16816–16825, June 2022.