Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs (2403.11755v3)

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Prompt ensembling of LLM generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-LLMs (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively

Abstract PDF HTML Chat (Pro)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces MPVR, a framework that automates zero-shot classification by generating diverse, category-specific visual prompts via LLM meta-prompting.
It shows that ensembling GPT and Mixtral generated prompts improves CLIP performance by up to 19.8% on several object recognition benchmarks.
The method minimizes human intervention and scales effectively, releasing a dataset of 2.5 million class descriptions to enhance recognition.

The paper introduces Meta-Prompting for Visual Recognition (MPVR), a novel automated framework designed to enhance zero-shot image recognition using LLMs and Vision-LLMs (VLMs). The core idea is to automate the generation of category-specific VLM prompts by meta-prompting LLMs, thus minimizing human intervention.

The approach involves a two-step process. First, the LLM is provided with a meta-prompt comprising a system prompt, an in-context example, and a short natural language description of the target task along with its class labels. This meta-prompt instructs the LLM to generate diverse task-specific LLM queries. Second, these generated queries are then used to obtain category-specific VLM prompts by querying the LLM again, this time specifying the class name. These category-specific VLM prompts are then ensembled into a zero-shot classifier.

The method leverages the knowledge of the visual world embedded within LLMs to produce a diverse set of prompts tailored to specific downstream tasks. The system prompt describes the meta-prompting task, and the in-context example contains a description of another task and its corresponding LLM queries. The in-context examples remain consistent across different downstream tasks. The LLM is then queried to produce LLM query templates containing a <class name> placeholder. These templates capture visual styles specific to the task but remain category-agnostic. Subsequently, for each class, the class label is inserted into the task-specific LLM query templates, and the LLM generates category-specific VLM prompts, describing the category in diverse visual ways and containing task-specific visual styles.

The authors emphasize that their meta-prompting strategy doesn't require dataset-specific parameters, except for the dataset description, which can be easily obtained from public APIs or the dataset's webpage. The generated prompts are shown to cover diverse visual concepts and styles, leading to significant performance gains across various zero-shot benchmarks.

The contributions of the paper are threefold: \begin{itemize} \item It introduces MPVR, a general automated framework for zero-shot classification that minimizes human involvement by using meta-prompting to tap into the visual world knowledge of LLMs. \item It demonstrates the generalizability of MPVR beyond closed models like GPT, showing that open-source models like Mixtral can also enhance the zero-shot recognition abilities of VLMs. \item It releases a dataset of approximately 2.5 million unique class descriptions generated from GPT and Mixtral using the meta-prompting framework, representing a large-scale dataset encompassing the breadth of LLM knowledge of the visual world. \end{itemize}

The paper evaluates MPVR on 20 object recognition datasets, including ImageNet, ImageNet-V2, CIFAR-10/100, Caltech-101, and others, and compares its performance against several baselines, including CLIP, CUPL, DCLIP, and Waffle. The results demonstrate that MPVR consistently outperforms the CLIP zero-shot baseline, with improvements of up to 19.8% and 18.2% on some datasets when using GPT and Mixtral, respectively. On average, MPVR improves upon CLIP by 5.0% and 4.5% across the 20 datasets.

Ablation studies are conducted to assess the significance of different components of MPVR. The results show that all major components of the meta-prompt, including the system prompt, in-context example, and downstream task specification, have a strong effect on the downstream performance.

The paper also explores ensembling different text sources, such as GPT-generated VLM prompts, Mixtral-generated VLM prompts, and dataset-specific templates from CLIP. The results indicate that ensembling over the embedding space with both GPT and Mixtral prompts performs the best. Additionally, the paper compares dual encoder models like CLIP with multi-modal LLMs (MMLMs) for image classification, finding that CLIP outperforms LLaVA on object recognition tasks, thus justifying the use of CLIP as the discriminative model in the study.

Finally, a scaling analysis demonstrates that increasing the number of generated VLM prompts significantly boosts performance, indicating promising scaling potential for MPVR.

In the experimental evaluation, the zero-shot likelihood of class $\hat{c}$ is defined as: $l_{\hat{c}}(x) = \frac{e^{\text{sim}(e_{\hat{c}}, e(x)) / \tau}}{\sum_{c \in C} e^{\text{sim}(e_c, e(x)) / \tau}}$ , where $e(x)$ is the image embedding, $e_c$ is the text embedding for class $c$ , and $\tau$ is the temperature constant.

$l_{\hat{c}}(x)$ : zero-shot likelihood of class $\hat{c}$ given image $x$
$e(x)$ : image embedding of image $x$
$C$ : set of candidate classes
$e_c$ : text embedding for class $c$
$\text{sim}$ : cosine similarity
$\tau$ : temperature constant

The text embedding $e_c$ is computed as: $e_c = \frac{1}{|P|} \sum_{p \in P} e(p(c))$ , where $P$ is the set of prompt templates and $p(c)$ is a prompt obtained by completing template $p$ with the label of class $c$ .

$e_c$ : text embedding for class $c$
$|P|$ : the number of prompt templates in the set $P$
$p \in P$ : prompt template $p$ in the set of prompt templates $P$
$p(c)$ : a prompt obtained by completing template $p$ with the label of class $c$
$e(p(c))$ : embedding of the prompt $p(c)$