Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoDE: CLIP Data Experts via Clustering (2404.16030v1)

Published 24 Apr 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Richard E Blahut. Fast algorithms for signal processing. Cambridge University Press, 2010.
  2. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  3. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  4. Hyperbolic image-text representations. In International Conference on Machine Learning, pages 7694–7731. PMLR, 2023.
  5. Concept decompositions for large sparse text data using clustering. Machine learning, 42:143–175, 2001.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  8. VSE++: improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 12. BMVA Press, 2018.
  9. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  10. Learning by transduction. arXiv preprint arXiv:1301.7375, 2013.
  11. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 35:29374–29385, 2022.
  12. Simcse: Simple contrastive learning of sentence embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pages 6894–6910. Association for Computational Linguistics (ACL), 2021.
  13. Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
  14. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 297–304, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
  15. Few-shot object detection with fully cross-transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5321–5330, 2022.
  16. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  17. Retrieval-enhanced contrastive vision-text models. arXiv preprint arXiv:2306.07196, 2023.
  18. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  20. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  21. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
  22. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260, 2015.
  23. Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798–21809, 2020.
  24. Overcoming catastrophic forgetting in neural networks. CoRR, abs/1612.00796, 2016.
  25. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  26. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
  27. An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017, 2023.
  28. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023.
  29. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  30. Learning customized visual models with retrieval-augmented knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15148–15158, 2023.
  31. Partner-assisted learning for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10573–10582, 2021.
  32. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, pages 32–41. Springer, 2014.
  33. Tom M Mitchell. Machine learning, 1997.
  34. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pages 529–544. Springer, 2022.
  35. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
  36. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  37. Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  38. Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  41. Global selection of contrastive batches via optimization on sample permutations. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  42. Stephan R Sain. The nature of statistical learning theory, 1996.
  43. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  44. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  45. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2016.
  46. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  47. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
  48. Ra-clip: Retrieval augmented contrastive language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19265–19274, 2023.
  49. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP, 7-11 November, 2021, pages 6787–6800. Association for Computational Linguistics, 2021.
  50. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023.
  51. TempCLR: Temporal alignment representation with contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
  52. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  53. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  54. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
Citations (6)

Summary

  • The paper presents a novel MoDE framework that clusters image-caption data to train specialized experts and mitigate noise in CLIP models.
  • The method employs semantically coherent clusters to enable asynchronous training, reducing computational costs to less than 35% of traditional approaches.
  • Experimental results show that MoDE outperforms larger models like ViT-L/14 in zero-shot image classification accuracy while enhancing scalability and efficiency.

Enhanced Training Efficiency in CLIP Models Through Mixture of Data Experts (MoDE)

Introduction

The paper presents a novel framework called Mixture of Data Experts (MoDE), which addresses the challenges in training Contrastive Language-Image Pre-training (CLIP) models. CLIP models generally suffer from the noise inherent in web-crawled data pairs (image-caption), impacting their training effectiveness. MoDE introduces a strategy to mitigate this by employing multiple data experts, each trained on distinct, semantically coherent data clusters. This approach enhances model robustness against false negative samples and improves training efficiency.

Approach

The core methodology of MoDE involves:

  • Clustering: Data is divided into fine-grained clusters, ensuring that each cluster maintains semantic coherence. This clustering is crucial as it allows each data expert to specialize, reducing sensitivity to noise in other data subsets.
  • Training of Data Experts: Each cluster is linked to a specific data expert model that trains solely on that cluster's data. This separation allows for focused and efficient learning.
  • Ensemble During Inference: At inference, outputs from different experts are combined. The weighting of these outputs is adjusted based on the correlation between task-specific metadata and the conditions of each cluster.

This structured approach not only tackles the noise issue but also streamlines the training process by allowing asynchronous training of data experts.

Experimental Results

The experimental evaluation of MoDE reveals several critical findings:

  • The MoDE framework, utilizing four CLIP data experts based on the ViT-B/16 architecture, outperforms the larger ViT-L/14 model used in OpenAI's CLIP and OpenCLIP in terms of zero-shot image classification accuracy.
  • This performance advantage is achieved with significantly lower training costs, specifically less than 35% of the computational and time resources compared to the baseline models.
  • It is also highlighted that the flexibility of the MoDE framework supports the addition of new data experts without requiring a complete retraining of the system.

Implications and Future Work

The MoDE approach significantly enhances the practicality and scalability of CLIP models by addressing key limitations around training efficiency and noise sensitivity. From a theoretical standpoint, the use of semantically coherent clusters and expert-based training could influence future designs of not only image-caption models but broader multimodal architectures.

Speculatively, the framework could be adapted for generative models, potentially offering a pathway to more efficient and scalable generative systems. Such developments could be critical as the demand for sophisticated, resource-efficient AI systems continues to grow.

Conclusion

MoDE represents a strategic evolution in the training of CLIP models, emphasizing efficiency, scalability, and robustness. By effectively utilizing a cluster-based, expert-driven training methodology, it sets a foundation for future advancements in both the practical deployment and theoretical development of generative and discriminative multimodal systems. Moreover, the asynchronous training capability and the potential for future expansion make MoDE an adaptable solution suited to the dynamic nature of AI research and application challenges.