Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation (2112.09445v3)
Abstract: Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially attributed to the fact that the image-text pairs are noisy. To address this, we propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning. Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero shot evaluation on Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.
- Label-embedding for attribute-based classification. CVPR, 2013.
- Evaluation of output embeddings for fine-grained image classification. CVPR, 2015.
- Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371, 2019.
- Label refinery: Improving ima- genet classification through label progression. arXiv preprint arXiv:1805.02641, 2018.
- A large annotated corpus for learning natural language inference. 2015.
- Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
- Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
- Learning the best pooling strategy for visual semantic embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
- Graph optimal transport for cross-domain alignment. ICML, 2020a.
- Wasserstein contrastive representation distillation. CVPR, 2021b.
- A simple framework for contrastive learning of visual representations. ICML, 2020b.
- Uniter: Universal image-text representation learning. In ECCV, 2020c.
- Optimal transport for domain adaptation. arXiv preprint arXiv:1507.00504v2, 2016.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013.
- Fbnetv3: Joint architecture-recipe search using neural acquisition function. arXiv preprint arXiv:2006.02049, 2020.
- An entropic optimal transport loss for learning deep neural networks under label noise in remote sensing images. arXiv preprint arXiv:1810.01163, 2018.
- Imagenet: A large-scale hierarchical image database. pp. 248–255, 2009.
- Virtex: Learning visual representations from textual annotations. arXiv preprint arXiv:2006.06666, 2020.
- wordnet: WordNet Interface, 2020. URL https://CRAN.R-project.org/package=wordnet. R package version 0.1-15.
- Devise: A deep visual-semantic embedding model. NIPS, 2013.
- Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659, 2020.
- Dimensionality reduction by learning an invariant mapping. CVPR, 2006.
- Deep residual learning for image recognition. CVPR, 2016.
- Momentum contrast for unsupervised visual representation learning. CVPR, 2020.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Image-to-word transformation based on dividing and vector quantizing images with words. In in Boltzmann machines”, Neural Networks, pp. 405409, 1999.
- Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2020.
- Learning visual features from large weakly supervised data. In ECCV, 2016.
- Rethinking knowledge graph propagation for zero-shot learning. CVPR, 2019.
- Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pp. 1889–1897, Cambridge, MA, USA, 2014. MIT Press.
- Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 2020.
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Learning visual n-grams from web data. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4193–4202, 2017. doi: 10.1109/ICCV.2017.449.
- Visual semantic reasoning for image-text matching. In ICCV, 2019.
- Hyperbolic visual embedding learning for zero-shot recognition. CVPR, 2020.
- Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021.
- S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4969–4983, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. URL https://www.aclweb.org/anthology/2020.acl-main.447.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- Zero-shot learning by convex combination of semantic embeddings. ICLR, 2014.
- Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
- Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Video event understanding using natural language descriptions. In 2013 IEEE International Conference on Computer Vision, pp. 905–912, 2013. doi: 10.1109/ICCV.2013.117.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
- Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
- Bernardino Romera-Paredes and Philip H. S. Torr. An embarrassingly simple approach to zero-shot learning. ICML, 2015.
- Improving gans using optimal transport. ICLR, 2018.
- Learning visual representations with caption annotations. arXiv preprint arXiv:2008.01392, 2020.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
- Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207–218, 2014. doi: 10.1162/tacl_a_00177. URL https://www.aclweb.org/anthology/Q14-1017.
- Learning from noisy labels with deep neural networks: A survey. arXiv preprint arXiv:2007.08199, 2020.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. arXiv preprint arXiv:2103.01913, 2021.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
- Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73, jan 2016. ISSN 0001-0782. doi: 10.1145/2812802. URL https://doi.org/10.1145/2812802.
- Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12965–12974, 2020.
- Learning models for object recognition from natural language descriptions. In Proceedings of the British Machine Vision Conference, 2009.
- Zero-shot recognition via semantic embeddings and knowledge graphs. CVPR, 2018.
- Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
- Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7, 2019a.
- Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CVPR, 2019b.
- Self-training with noisy student improves imagenet classification. CVPR, 2020.
- Contrastive learning of medical visual representations from paired images and texts. arXiv preprint arXiv:2010.00747, 2020.