Masked Unsupervised Self-training for Label-free Image Classification (2206.02967v2)

Published 7 Jun 2022 in cs.CV and cs.AI

Abstract: State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data from a target domain to improve the performance of a pre-trained zero-shot classifier, by unsupervised finetuning of the pre-trained model. We propose Masked Unsupervised Self-Training (MUST), a new unsupervised adaptation method which leverages two different and complementary sources of training signals: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on a variety of downstream tasks, where it improves upon CLIP by a large margin. MUST also outperforms supervised few-shot adaptation methods. It achieves a top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP, and +6.2% higher than 16-shot CLIP adaptation. Our code is available at https://github.com/salesforce/MUST.

Citations (8)

View on Semantic Scholar

Summary

The paper presents MUST, a label-free self-training technique that integrates pseudo-labels and masked image modeling for robust feature extraction.
It optimizes three objectives concurrently to align global class predictions with local pixel-level details, achieving a top-1 accuracy of 77.7% on ImageNet.
MUST reduces annotation costs and augments model performance, offering a scalable solution for unsupervised image classification.

Overview of "Masked Unsupervised Self-training for Label-free Image Classification"

The paper "Masked Unsupervised Self-training for Label-free Image Classification" introduces a novel approach to enhancing zero-shot image classifiers using unlabeled data. The key contribution of the paper is the proposed methodology, Masked Unsupervised Self-Training (MUST), which leverages unlabeled images to improve performance over existing models like CLIP through unsupervised finetuning. MUST mitigates the reliance on costly annotations that limit the scalability of supervised learning models by bridging self-supervised and self-training paradigms.

Key Contributions

The paper identifies the following pivotal components of the MUST methodology:

Integration of Pseudo-Labels and Raw Images: MUST optimizes three concurrent objectives by utilizing pseudo-labels for global classification and masked image models for local feature extraction. It enforces an alignment between global class-level predictions and local pixel-level details without requiring additional labeled data.
Three-fold Objective Optimization: The unsupervised finetuning takes a detailed approach by targeting:
- A self-training objective that generates pseudo-labels via an exponential moving average teacher model.
- A masked image modeling objective focusing on local feature recovery from masked image patches.
- A global-local feature alignment objective that aligns class predictions with local features.
Effectiveness Compared to Supervised Methods: The paper presents empirical evidence showing that MUST surpasses supervised few-shot methods. For instance, it achieves a top-1 accuracy of 77.7% on ImageNet using ViT-B, a substantial improvement over CLIP.

Numerical Results

The MUST algorithm significantly improves across various datasets, indicating its robustness and adaptability:

On ImageNet, it achieves a top-1 accuracy improvement of +9.4% over CLIP and +6.2% over 16-shot CLIP adaptation.
This improvement is consistent across different domains, including food, scenes, and textures, with notable performance boosts.

Implications and Future Directions

The implications of MUST are straightforward yet profound, as the model offers a label-free adaptation framework enabling efficient domain-specific training. Practically, this reduces costs associated with data annotation while theoretically challenging the conventional separation between self-supervised learning and supervised initialization.

Future developments can explore extending MUST to different modalities, such as NLP, where similar masked LLMing techniques might bridge self-supervised tasks with self-training approaches. The modular nature of MUST also suggests possibilities for integration with other training paradigms and optimizations to its constituent components, like pseudo-label selection and image masking strategies.

Overall, the paper represents an advancement in unsupervised learning methodologies, specifically within the domain of label-free image classification, and encourages further exploration into efficient model adaptation without reliance on extensive labeled datasets.

PDF Markdown

Related Papers

GitHub

GitHub - salesforce/MUST: PyTorch code for MUST (108 stars)