Compositional Kronecker Context Optimization for Vision-Language Models (2403.11631v1)
Abstract: Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-LLMs to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.
- Prefix-tuning: Optimizing continuous prompts for generation. In ACL/IJCNLP, pages 4582–4597, 2021.
- The power of scale for parameter-efficient prompt tuning. In EMNLP, pages 3045–3059, 2021.
- Factual probing is [MASK]: learning vs. learning to recall. In NAACL-HLT, pages 5017–5033, 2021.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022a.
- Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022b.
- Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763, 2021.
- FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
- Prompt-aligned gradient for prompt tuning. ICCV, 2023.
- Write and paint: Generative vision-language models are unified modal learners. In ICLR, 2023.
- Vl-beit: Generative vision-language pretraining. CoRR, abs/2206.01127, 2022.
- Unified contrastive learning in image-text-label space. In CVPR, pages 19141–19151, 2022.
- Florence: A new foundation model for computer vision. CoRR, abs/2111.11432, 2021.
- DenseCLIP: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, 2022.
- CRIS: clip-driven referring image segmentation. In CVPR, pages 11676–11685, 2022.
- Language models are few-shot learners. In NeurIPS, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
- Making pre-trained language models better few-shot learners. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, ACL/IJCNLP, pages 3816–3830, 2021.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, EMNLP, pages 4222–4235, 2020.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35, 2023.
- Dualcoop: Fast adaptation to multi-label recognition with limited annotations. In NeurIPS, 2022.
- Prompt tuning with soft context sharing for vision-language models. CoRR, abs/2208.13474, 2022.
- Prompt distribution learning. In CVPR, pages 5196–5205, 2022.
- Neural prompt search. CoRR, abs/2206.04673, 2022.
- Sparse representation for computer vision and pattern recognition. Proc. IEEE, 98(6):1031–1044, 2010.
- Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans. Neural Networks Learn. Syst., 23(11):1738–1754, 2012.
- On compressing deep models by low rank and sparse decomposition. In CVPR, pages 67–76, 2017.
- Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1943–1955, 2016.
- Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408, 2017.
- Towards efficient tensor decomposition-based DNN model compression with optimization framework. In CVPR, pages 10674–10683, 2021.
- Pruning filters for efficient convnets. In ICLR, 2017.
- Fast orthogonal projection based on kronecker product. In ICCV, pages 2929–2937, 2015.
- Circulant binary embedding. In ICML, pages 946–954, 2014.
- Siyu Liao and Bo Yuan. Circconv: A structured convolution with low complexity. In AAAI, pages 4287–4294, 2019.
- Structured prompt tuning. CoRR, abs/2205.12309, 2022.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst., 106(1):59–70, 2007.
- Food-101 - mining discriminative components with random forests. In ECCV, pages 446–461, 2014.
- Cats and dogs. In CVPR, pages 3498–3505, 2012.
- 3D object representations for fine-grained categorization. In ICCVW, pages 554–561, 2013.
- Automated flower classification over a large number of classes. In ICVGIP, pages 722–729, 2008.
- Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226, 2019.
- SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492, 2010.
- Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
- Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400, 2019.
- Learning robust global representations by penalizing local predictive power. In NeurIPS, pages 10506–10518, 2019.
- Generating natural adversarial examples with universal perturbations for text classification. Neurocomputing, 471:175–182, 2022.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8320–8329, 2021.