LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections (2305.18287v2)

Published 29 May 2023 in cs.CV and cs.CL

Abstract: Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a LLM describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision.

Citations (26)

View on Semantic Scholar

Summary

The paper introduces LaFTer, a label-free tuning approach that significantly boosts zero-shot classifier performance using text data and pseudo-labeling.
It employs a novel training paradigm that first refines a text classifier with LLM-generated prompts and then integrates it with a CLIP visual encoder.
The unsupervised fine-tuning step with pseudo-labeling enhances visual classification accuracy, outperforming other state-of-the-art label-free methods.

Introduction

Recent advances in Vision and Language (VL) models, such as CLIP, provide powerful tools for zero-shot visual classification by leveraging open-vocabulary prompts generated from natural language descriptions. Despite their profound potential, VL models often lag behind supervised classifiers in performance. This limitation arises from the need for additional training to rival the accuracy of fully supervised methods. Addressing this challenge, a new paper explores an innovative way to enhance VL models—in particular, the zero-shot classifier's effectiveness—without relying on labeled data, thus circumventing the costly process of manual labeling.

Training Without Labels

In an exciting development, a procedure called LaFTer has been introduced, allowing significant performance enhancements for zero-shot classifiers beyond their base models. The method does not require any label information or explicit image-text pairs. Instead, it employs a label-free approach, capitalizing on text data alone to train a classifier. The text dataset is ingeniously crafted by prompting a LLM with class names and integrating the results with handcrafted prompts. The trained text classifier, albeit designed for text, demonstrates an impressive ability to classify images when utilized alongside a CLIP visual encoder.

Unsupervised Fine-Tuning

Researchers didn't stop at training a classifier using only text data; they pushed the envelope by incorporating a novel unsupervised fine-tuning step. This stage employs pseudo-labeling, inspired by the FixMatch technique, which uses the previously trained text classifier to generate tentative labels for a collection of unlabeled images. These pseudo-labels spar an iterative refinement process, improving the visual encoder's capacity to distinguish between image classes without supervision. Moreover, this entire process remains highly parameter-efficient, a significant consideration to prevent overfitting and maintain the practical viability of the approach.

Conclusive Analysis

The incorporation of LaFTer marks a new paradigm in adapting VL models to target tasks without the conventional reliance on labeled datasets. Through rigorous evaluations across various benchmarks, LaFTer has shown to outperform state-of-the-art methods under the same label-free conditions, even challenging the supremacy of few-shot methods which utilize minimal labeled data. This framework could fundamentally change how VL models are trained and adapted, making the process more cost-effective and scalable without compromising performance efficacy. The broader implications of such a methodology suggest potential advancements in numerous visual classification applications, from enhancing legacy systems in security to streamlining quality control processes.

With LaFTer, the adaptation of VL models becomes a less constrained issue, liberating them from the typical bottlenecks associated with data labeling. This liberating advancement in VL model training paves the way for broader applications where the cost and logistical challenges of obtaining labeled data have been prohibitive. It has the potential to inspire further research and innovation in the field of artificial intelligence and machine learning by making high-performing visual classifiers more accessible and easier to deploy across various domains.

PDF Markdown

Related Papers

GitHub

LaFTer