Consistency-guided Prompt Learning for Vision-Language Models (2306.01195v4)
Abstract: We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-LLMs. Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting. The basic idea of CoPrompt is to enforce a consistency constraint in the prediction of the trainable and pre-trained models to prevent overfitting on the downstream task. Additionally, we introduce the following two components into our consistency constraint to further boost the performance: enforcing consistency on two perturbed inputs and combining two dominant paradigms of tuning, prompting and adapter. Enforcing consistency on perturbed input serves to further regularize the consistency constraint, thereby improving generalization. Moreover, the integration of adapters and prompts not only enhances performance on downstream tasks but also offers increased tuning flexibility in both input and output spaces. This facilitates more effective adaptation to downstream tasks in a few-shot learning setting. Experiments show that CoPrompt outperforms existing methods on a range of evaluation suites, including base-to-novel generalization, domain generalization, and cross-dataset evaluation. On generalization, CoPrompt improves the state-of-the-art on zero-shot tasks and the overall harmonic mean over 11 datasets. Detailed ablation studies show the effectiveness of each of the components in CoPrompt. We make our code available at https://github.com/ShuvenduRoy/CoPrompt.
- Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
- Food-101–mining discriminative components with random forests. In European Conference on Computer Vision, pp. 446–461, 2014.
- Language models are few-shot learners. pp. 1877–1901, 2020.
- Vision transformer adapter for dense predictions. In International Conference on Learning Representations, 2022.
- Describing textures in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3606–3613, 2014.
- Randaugment: Practical automated data augmentation with a reduced search space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703, 2020.
- Imagenet: A large-scale hierarchical image database. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
- Bayesian prompt learning for image-language model generalization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15237–15246, 2023.
- Eva: Exploring the limits of masked visual representation learning at scale. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369, 2023.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 178–178, 2004.
- Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pp. 1–15, 2023.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021a.
- Natural adversarial examples. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271, 2021b.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916, 2021.
- Maple: Multi-modal prompt learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19113–19122, 2023a.
- Self-regulating prompts: Foundational model adaptation without forgetting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15190–15200, 2023b.
- 3d object representations for fine-grained categorization. In IEEE/CVF International Conference on Computer Vision, pp. 554–561, 2013.
- The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. In International Joint Conference on Natural Language Processing, pp. 4582–4597, 2021.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
- Prompt distribution learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5206–5215, 2022.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729, 2008.
- Cats and dogs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3498–3505, 2012.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18082–18091, 2022.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389–5400, 2019.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Zero-shot learning-the good, the bad and the ugly. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4582–4591, 2017.
- Sun database: Large-scale scene recognition from abbey to zoo. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3485–3492, 2010.
- Visual-language prompt tuning with knowledge-guided context optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6757–6767, 2023.
- Filip: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2021.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133, 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022a.
- Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825, 2022b.
- Prompt-aligned gradient for prompt tuning. In IEEE/CVF International Conference on Computer Vision, pp. 15659–15669, 2023.