Papers
Topics
Authors
Recent
2000 character limit reached

Multi-modal Attribute Prompting for Vision-Language Models (2403.00219v3)

Published 1 Mar 2024 in cs.CV

Abstract: Pre-trained Vision-LLMs (VLMs), like CLIP, exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios. Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics. This limitation hinders the model's ability to perceive fine-grained visual details and restricts its generalization ability to a broader range of unseen classes. To address this issue, we propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment. The proposed MAP enjoys several merits. First, we introduce learnable visual attribute prompts enhanced by textual attribute semantics to adaptively capture visual attributes for images from unknown categories, boosting fine-grained visual perception capabilities for CLIP. Second, the proposed attribute-level alignment complements the global alignment to enhance the robustness of cross-modal alignment for open-vocabulary objects. To our knowledge, this is the first work to establish cross-modal attribute-level alignment for CLIP-based few-shot adaptation. Extensive experimental results on 11 datasets demonstrate that our method performs favorably against state-of-the-art approaches.

Citations (3)

Summary

  • The paper introduces Multi-modal Attribute Prompting (MAP) that enhances fine-grained visual and textual alignment in few-shot scenarios.
  • It leverages both textual and visual attribute prompting along with an optimal transport-based alignment mechanism for precise cross-modal matching.
  • Experiments demonstrate superior performance in base-to-novel generalization and domain adaptation across multiple datasets compared to prior methods.

Multi-modal Attribute Prompting for Vision-LLMs

Introduction

The paper, "Multi-modal Attribute Prompting for Vision-LLMs" (2403.00219), addresses the adaptation challenges faced by large pre-trained Vision-LLMs (VLMs) such as CLIP in few-shot scenarios. These models, although proficient in generalization tasks, falter when exposed to limited data conditions due to their reliance on global representations and lack of detailed multi-modal attribute characterization. The authors propose a novel approach called Multi-modal Attribute Prompting (MAP), which integrates textual attribute prompting, visual attribute prompting, and a unique attribute-level alignment mechanism to enhance the model's fine-grained perceptual abilities and robustness in unseen class contexts. Figure 1

Figure 1: Conventional prompting methods versus multi-modal attribute exploration for fine-grained alignment.

Methodology

Textual Attribute Prompting

The methodology introduces a step called Textual Attribute Prompting, which utilizes LLMs to generate enriched semantic content by querying specific discriminative features of image classes. This process involves constructing text prompts that envelop class names with detailed attribute descriptions, thereby creating a richer semantic context than traditional class-only prompts.

Visual Attribute Prompting

In conjunction with textual attributes, the paper introduces Visual Attribute Prompting through learnable vectors that are strategically inserted into the Vision Transformer layers. These visual prompts interact with image tokens, thereby enabling the extraction and aggregation of fine-grained visual features. The Adaptive Visual Attribute Enhancement (AVAE) module further refines visual prompts by aligning them with selected textual attributes using a cross-attention mechanism, allowing for dynamic recognition capabilities in unforeseen categories. Figure 2

Figure 2: MAP architecture illustrating the integration of textual and visual prompts along with attribute-level alignment.

Attribute-Level Alignment

To overcome the inherent limitations of global alignment, the authors reformulate the alignment task into an Optimal Transport problem, thereby enabling precise attribute-level matching. By computing similarity scores between the visual and textual attribute distributions, the MAP approach achieves a robust cross-modal alignment, significantly reducing disruptions from complex scenarios and irrelevant details.

Experimental Analysis

The proposed MAP method demonstrates its efficacy through extensive evaluation across multiple settings, including base-to-novel class generalization, few-shot image classification, domain generalization, and cross-dataset evaluation.

Base-to-Novel Generalization

As evidenced by the results, MAP exhibits superior performance in transitioning from base classes to novel unknown classes, achieving higher harmonic mean accuracy than previous methods like CoOp and CoCoOp across 11 diverse datasets. This underscores MAP's enhanced capability in encapsulating categorical semantics and improving model adaptability. Figure 3

Figure 3: Performance comparison in base-to-novel generalization across 11 datasets.

Few-Shot Image Classification

MAP consistently outperforms other CLIP adaptation methods in few-shot scenarios, particularly notable in 1-shot conditions, where its introduction of text-guided visual prompts shows remarkable performance gains. Figure 4

Figure 4: Few-shot accuracy improvements attributed to AVAE module incorporation across multiple layers.

Domain and Cross-Dataset Generalization

Further experiments reveal MAP's robustness in domain-shifted environments and varied datasets, maintaining high accuracy levels compared to other state-of-the-art approaches. This cross-domain applicability highlights the precision of the attribute alignment strategies employed in MAP. Figure 5

Figure 5: Average performance over six few-shot classification datasets and visual prompt impact in domain generalization.

Conclusion

The Multi-modal Attribute Prompting method significantly advances the adaptation capabilities of Vision-LLMs in few-shot learning by introducing innovative techniques for modeling and aligning fine-grained visual and textual attributes. This dual approach not only enhances perceptual detail recognition but also improves robustness against disruptions, suggesting promising directions for future enhancements in AI model adaptability and real-world application scalability.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.