Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions (2401.02460v2)

Published 4 Jan 2024 in cs.CV

Abstract: The zero-shot performance of existing vision-LLMs (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by LLMs and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark, consisting of 14 datasets at https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future research in zero-shot recognition.

References (45)

Citations (11)

View on Semantic Scholar

Summary

The paper outlines a standardized rebuttal framework to help authors address factual inaccuracies and review requests without introducing new findings.
It details strict formatting requirements, including a one-page, two-column layout and the removal of author-identifying information for fairness.
The guidelines ensure clarity and quality in the review process, promoting consistent scientific discourse while accommodating future submission formats.

Technical Insights into CVPR Author Rebuttal Guidelines

This paper serves as a detailed instructional document for authors preparing a rebuttal to reviews received for their submissions to conferences such as the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). The guidelines are designed to offer a standardized framework that assists authors in effectively addressing reviewer feedback while adhering to the constraints imposed by the conference committee. The paper outlines the structural, content, and formatting specifications necessary for a well-regarded rebuttal.

Key Aspects of the Rebuttal Process

The principal aim of the rebuttal is to provide authors with a capability to address factual inaccuracies or to submit additional data as explicitly requested by reviewers. The rebuttal is not an avenue for introducing new findings or substantially altering the content of the original submission unless specifically solicited by the reviewers. This rule is critical in maintaining the integrity and consistency of the review process.

A noteworthy procedural aspect enforced by the CVPR committee, stemming from a 2018 PAMI-TC motion, is the recommendation that reviewers avoid demanding extensive new experiments during the rebuttal period. Thus, the emphasis remains on clarifying existing points rather than the restructuring or expansion of the research inquiry.

Formatting and Presentation Standards

The document stipulates specific formatting criteria to be followed by authors, ensuring uniformity and readability. The rebuttal must be precisely one page and maintain anonymity by refraining from the inclusion of any kind of author-identifiable information or external links. The standardized two-column layout, alongside defined margin parameters, must be strictly adhered to, ensuring equitable space for responses across submissions.

Graphics and equation placements are also scrutinized for clarity in both digital and printed formats. This is paramount since reviewers may print documents for evaluation, necessitating that all graphical elements be clearly legible without digital magnification.

Implications and Forward-Looking Insights

The enforcement of these rebuttal guidelines underscores the CVPR's commitment to a fair and transparent peer review ecosystem. By setting these structural standards, the conference ensures that authors focus their efforts on the methodological robustness and scientific merit of their rebuttal responses rather than aesthetic embellishments or content alterations.

Looking forward, as the field of AI and computer vision continues to expand, these guidelines may evolve to accommodate new forms of scientific expression, such as interactive or multi-modal submissions. The primary principle of enhancing communication efficacy, however, will likely remain paramount. This paper offers critical insights into maintaining the quality and consistency of scientific discourse within high-stakes academic settings, setting a precedent for future conferences and their review protocols.

PDF Markdown

Related Papers

GitHub

GitHub - cvl-umass/AdaptCLIPZS (13 stars)

Tweets

https://twitter.com/oindy_saha/status/1762503834717319459

https://twitter.com/oindy_saha/status/1744192930037055632

https://twitter.com/oindy_saha/status/1755229108697550977