Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition (2304.04704v2)

Published 10 Apr 2023 in cs.CV, cs.AI, and cs.CL

Abstract: This work proposes POMP, a prompt pre-training method for vision-LLMs. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.

References (69)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces POMP, a prompt pre-training framework that boosts zero-shot open-vocabulary visual recognition while slashing memory usage from 300GB to 16GB.
The paper leverages an innovative local contrast strategy that samples limited class subsets and applies local correction to improve contrastive learning efficiency.
The paper achieves state-of-the-art results on ImageNet-21K classification and transfers its strong performance to multiple downstream tasks without fine-tuning.

An Expert Overview of POMP for Open-Vocabulary Visual Recognition

The paper presents a promising approach named Prompt Pre-training with Many Classes (POMP), tailored for vision-LLMs to enhance open-vocabulary visual recognition. The primary objective is to address the challenge of zero-shot recognition by introducing a method that efficiently manages computational and memory demands to support the extensive ImageNet-21K dataset, encompassing over 20,000 classes.

Key Contributions

Prompt Pre-training (POMP): The authors propose a prompt pre-training framework that leverages the scalability of large-scale datasets like ImageNet-21K. By pre-training a universal soft prompt, POMP significantly bolsters the vision-LLMs' ability to generalize across novel visual categories without task-specific fine-tuning. This universality is achieved via a class sampling mechanism, drastically reducing memory requirements from 300 GB to less than 16 GB.
Local Contrast and Correction: POMP innovatively employs a class sampling strategy termed 'local contrast' to limit computational overhead. This method samples a subset from the total class set for each iteration, thus narrowing the focus of contrastive learning and enhancing efficiency. Complementarily, a 'local correction' strategy is introduced to rectify the biases introduced by this sampling, ensuring the prompt maintains generalization capability.

Empirical Results

Empirical performance evaluations reveal that POMP outpaces existing state-of-the-art (SOTA) models across diverse visual recognition tasks:

Image Classification: On the ImageNet-21K test set, POMP attains a leading accuracy of 25.3%. Transferring this prompt to ten downstream image datasets results in the highest average accuracy of 67.0%, which is substantially superior to previous benchmarks, affirming its generalization strength across different domains.
Semantic Segmentation and Object Detection: For COCO Stuff and Pascal VOC semantic segmentation tasks, POMP garners a respectable improvement in harmonic IoU (hIoU) by 39.1 and 84.4, respectively, over prior methods such as ZSSeg. Similarly, POMP exhibits an increase in AP scores for object detection tasks, reflecting its efficacy in recognizing diverse and unseen object categories.

Practical and Theoretical Implications

The proposed POMP method holds profound implications for expanding the capabilities of visual recognition systems:

Scalability and Efficiency: POMP's design mitigates the prohibitive computational requirements traditionally associated with large-scale datasets and class sets, rendering it practical for deployment in diverse real-world applications where zero-shot capabilities are pivotal.
Generalization Across Tasks: The adaptability of the pre-trained prompt to various vision tasks without specialized fine-tuning underscores a significant step towards creating more versatile and robust AI systems, facilitating broader adoption across dynamic environments.

Speculations on Future Developments

Future work might explore the following directions:

Theoretical Robustness Analysis: There is a need for a rigorous theoretical examination of risk associated with empirical estimation of contrastive loss through class subsampling to strengthen the foundational aspects of the POMP framework.
Semantic Utilization via Hierarchies: Leveraging the semantic structure within datasets such as ImageNet-21K, augmented with techniques that utilize hyponym and hypernym relationships, could potentially refine the representational quality and effectiveness of the soft prompt.
Interpretability of Soft Prompts: Addressing the challenges surrounding the interpretability of the continuous optimized vectors in soft prompts could pave the way for more transparent AI systems and foster trust in AI-driven decision-making.

In conclusion, POMP advances the capabilities of vision-LLMs considerably, setting a foundation for future research in open-vocabulary recognition and contributing to the discourse on achieving more scalable, efficient, and generalizable AI systems.

PDF Markdown

GitHub

GitHub - amazon-science/prompt-pretraining: Official implementation for the paper "Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition" (258 stars)