Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies (2404.08197v2)
Abstract: This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Neural Information Processing Systems, 2019.
- Pali: A jointly-scaled multilingual language-image model. ArXiv, abs/2209.06794, 2022.
- Microsoft coco captions: Data collection and evaluation server. ArXiv, abs/1504.00325, 2015.
- Uniter: Universal image-text representation learning. In European Conference on Computer Vision, 2019.
- Reproducible scaling laws for contrastive language-image learning. ArXiv, abs/2212.07143, 2022.
- Randaugment: Practical automated data augmentation with a reduced search space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3008–3017, 2019.
- Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
- No one representation to rule them all: Overlapping features of training methods. arXiv preprint arXiv:2110.12899, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Natural adversarial examples. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15257–15266, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8320–8329, 2020.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 2021.
- Supervised contrastive learning. ArXiv, abs/2004.11362, 2020.
- Scaling language-image pre-training via masking. ArXiv, abs/2212.00794, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
- A convnet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11966–11976, 2022.
- Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. ArXiv, abs/2107.04649, 2021.
- Slip: Self-supervision meets language-image pre-training. ArXiv, abs/2112.12750, 2021.
- Quality not quantity: On the interaction between dataset design and robustness of clip. ArXiv, abs/2208.05516, 2022.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
- Combined scaling for zero-shot transfer learning. ArXiv, abs/2111.10050, 2021.
- Improving language understanding by generative pre-training. OpenAI Blog, 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389–5400, 2019.
- Adafactor: Adaptive learning rates with sublinear memory cost. ArXiv, abs/1804.04235, 2018.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Measuring robustness to natural distribution shifts in image classification. ArXiv, abs/2007.00644, 2020.
- Mlp-mixer: An all-mlp architecture for vision. In Neural Information Processing Systems, 2021.
- Learning robust global representations by penalizing local predictive power. In Neural Information Processing Systems, 2019.
- Lit: Zero-shot transfer with locked-image text tuning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18102–18112, 2021.