Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt Learning with Data-Dependent Prior (2401.06799v1)
Abstract: Recent Vision-Language Pretrained (VLP) models have become the backbone for many downstream tasks, but they are utilized as frozen model without learning. Prompt learning is a method to improve the pre-trained VLP model by adding a learnable context vector to the inputs of the text encoder. In a few-shot learning scenario of the downstream task, MLE training can lead the context vector to over-fit dominant image features in the training data. This overfitting can potentially harm the generalization ability, especially in the presence of a distribution shift between the training and test dataset. This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application and increase the adaptability of prompts on unseen instances. Specifically, modeling data-dependent prior enhances the adaptability of text features for both seen and unseen image features without the trade-off of performance between them. Based on the Bayesian framework, we utilize the Wasserstein Gradient Flow in the estimation of our target posterior distribution, which enables our prompt to be flexible in capturing the complex modes of image features. We demonstrate the effectiveness of our method on benchmark datasets for several experiments by showing statistically significant improvements on performance compared to existing methods. The code is available at https://github.com/youngjae-cho/APP.
- Information Maximization in Noisy Channels : A Variational Approach. In Thrun, S.; Saul, L.; and Schölkopf, B., eds., Advances in Neural Information Processing Systems, volume 16. MIT Press.
- An intuitive proof of the data processing inequality. arXiv:1107.0740.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, 446–461. Springer.
- A Unified Particle-Optimization Framework for Scalable Bayesian Sampling. arXiv:1805.11659.
- PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. In The Eleventh International Conference on Learning Representations.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3606–3613.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Bayesian Prompt Learning for Image-Language Model Generalization. arXiv:2210.02390.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, 178–178. IEEE.
- PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior. In International Conference on Learning Representations.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7): 2217–2226.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8340–8349.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15262–15271.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916. PMLR.
- The Variational Formulation of the Fokker–Planck Equation. SIAM Journal on Mathematical Analysis, 29(1): 1–17.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, 554–561.
- Data-dependent Gaussian Prior Objective for Language Generation. In International Conference on Learning Representations.
- Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Lee, D.; Sugiyama, M.; Luxburg, U.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5206–5215.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 722–729. IEEE.
- Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3498–3505.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, 5389–5400. PMLR.
- Optimal Representations for Covariate Shift. In International Conference on Learning Representations.
- How Much Can CLIP Benefit Vision-and-Language Tasks? In International Conference on Learning Representations.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Decomposed mutual information estimation for contrastive representation learning. In International Conference on Machine Learning, 9859–9869. PMLR.
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32.
- Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), 681–688.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7959–7971.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, 3485–3492. IEEE.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
- Youngjae Cho (2 papers)
- HeeSun Bae (9 papers)
- Seungjae Shin (15 papers)
- Yeo Dong Youn (1 paper)
- Weonyoung Joo (11 papers)
- Il-Chul Moon (39 papers)