Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compositional Kronecker Context Optimization for Vision-Language Models (2403.11631v1)

Published 18 Mar 2024 in cs.CV

Abstract: Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-LLMs to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Prefix-tuning: Optimizing continuous prompts for generation. In ACL/IJCNLP, pages 4582–4597, 2021.
  2. The power of scale for parameter-efficient prompt tuning. In EMNLP, pages 3045–3059, 2021.
  3. Factual probing is [MASK]: learning vs. learning to recall. In NAACL-HLT, pages 5017–5033, 2021.
  4. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022a.
  5. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022b.
  6. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763, 2021.
  7. FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
  8. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
  9. Prompt-aligned gradient for prompt tuning. ICCV, 2023.
  10. Write and paint: Generative vision-language models are unified modal learners. In ICLR, 2023.
  11. Vl-beit: Generative vision-language pretraining. CoRR, abs/2206.01127, 2022.
  12. Unified contrastive learning in image-text-label space. In CVPR, pages 19141–19151, 2022.
  13. Florence: A new foundation model for computer vision. CoRR, abs/2111.11432, 2021.
  14. DenseCLIP: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, 2022.
  15. CRIS: clip-driven referring image segmentation. In CVPR, pages 11676–11685, 2022.
  16. Language models are few-shot learners. In NeurIPS, 2020.
  17. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  18. Making pre-trained language models better few-shot learners. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, ACL/IJCNLP, pages 3816–3830, 2021.
  19. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, EMNLP, pages 4222–4235, 2020.
  20. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35, 2023.
  21. Dualcoop: Fast adaptation to multi-label recognition with limited annotations. In NeurIPS, 2022.
  22. Prompt tuning with soft context sharing for vision-language models. CoRR, abs/2208.13474, 2022.
  23. Prompt distribution learning. In CVPR, pages 5196–5205, 2022.
  24. Neural prompt search. CoRR, abs/2206.04673, 2022.
  25. Sparse representation for computer vision and pattern recognition. Proc. IEEE, 98(6):1031–1044, 2010.
  26. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans. Neural Networks Learn. Syst., 23(11):1738–1754, 2012.
  27. On compressing deep models by low rank and sparse decomposition. In CVPR, pages 67–76, 2017.
  28. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1943–1955, 2016.
  29. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408, 2017.
  30. Towards efficient tensor decomposition-based DNN model compression with optimization framework. In CVPR, pages 10674–10683, 2021.
  31. Pruning filters for efficient convnets. In ICLR, 2017.
  32. Fast orthogonal projection based on kronecker product. In ICCV, pages 2929–2937, 2015.
  33. Circulant binary embedding. In ICML, pages 946–954, 2014.
  34. Siyu Liao and Bo Yuan. Circconv: A structured convolution with low complexity. In AAAI, pages 4287–4294, 2019.
  35. Structured prompt tuning. CoRR, abs/2205.12309, 2022.
  36. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  37. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst., 106(1):59–70, 2007.
  38. Food-101 - mining discriminative components with random forests. In ECCV, pages 446–461, 2014.
  39. Cats and dogs. In CVPR, pages 3498–3505, 2012.
  40. 3D object representations for fine-grained categorization. In ICCVW, pages 554–561, 2013.
  41. Automated flower classification over a large number of classes. In ICVGIP, pages 722–729, 2008.
  42. Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
  43. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  44. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226, 2019.
  45. SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492, 2010.
  46. Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
  47. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400, 2019.
  48. Learning robust global representations by penalizing local predictive power. In NeurIPS, pages 10506–10518, 2019.
  49. Generating natural adversarial examples with universal perturbations for text classification. Neurocomputing, 471:175–182, 2022.
  50. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8320–8329, 2021.

Summary

We haven't generated a summary for this paper yet.