Extract Free Dense Labels from CLIP

Published 2 Dec 2021 in cs.CV and cs.CL | (2112.01071v2)

Abstract: Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available at https://github.com/chongzhou96/MaskCLIP.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (363)

View on Semantic Scholar

Summary

The paper introduces MaskCLIP, harnessing CLIP's pre-trained dense features to perform pixel-level semantic segmentation without the need for fine-tuning.
MaskCLIP+ builds on this by generating high-quality pseudo-labels, achieving significant mIoU improvements on datasets like PASCAL VOC.
The study highlights an annotation-free approach to segmentation, paving the way for further integration of language-image models in dense prediction tasks.

Extract Free Dense Labels from CLIP

The paper "Extract Free Dense Labels from CLIP," authored by Chong Zhou, Chen Change Loy, and Bo Dai, examines the use of Contrastive Language-Image Pre-training (CLIP) for pixel-level dense prediction tasks, specifically semantic segmentation. The authors propose a novel approach called MaskCLIP, which utilizes CLIP's pre-trained models to achieve competitive segmentation results without the need for fine-tuning or annotations.

Overview

CLIP, developed by OpenAI, has made significant advancements in open-vocabulary zero-shot image recognition through the use of large-scale visual-text pre-training. Typically leveraged for image-level tasks, the authors explore CLIP's potential for pixel-level semantic segmentation, offering an innovative perspective on its capabilities. The study demonstrates that MaskCLIP can effectively perform segmentation by extracting dense patch-level features from the CLIP image encoder.

Methodology

MaskCLIP: This approach involves minimal modifications to CLIP's architecture. By using the features from the last attention layer as dense features and retaining the visual-language association in CLIP's original feature space, the model accomplishes segmentation tasks for diverse concepts without requiring fine-tuning. MaskCLIP applies techniques such as key smoothing and prompt denoising to enhance performance, improving prediction accuracy without additional training.
MaskCLIP+: Building on MaskCLIP, this model addresses the limitations of the rigid architecture by creating a pseudo-labeling mechanism. It employs MaskCLIP to generate high-quality pseudo labels, facilitating a self-training approach. MaskCLIP+ can therefore leverage more advanced segmentation architectures like DeepLab and PSPNet.

Experimental Results

Empirical results highlight the effectiveness of MaskCLIP and MaskCLIP+:

On datasets such as PASCAL VOC, PASCAL Context, and COCO Stuff, MaskCLIP+ surpasses state-of-the-art transductive zero-shot semantic segmentation methods with significant mIoU improvements for unseen classes.
In the absence of annotations, MaskCLIP+ achieves impressive improvements (e.g., mIoUs of unseen classes on PASCAL VOC from 35.6 to 86.1).
The robustness assessment indicates that MaskCLIP maintains performance under various input corruptions, showcasing its potential in real-world applications.

Implications and Future Directions

MaskCLIP underscores the potential of CLIP features for dense prediction, advocating for a shift away from traditional fine-tuning approaches that disrupt pre-trained feature spaces. By retaining CLIP's visual-language associations, the proposed methods enhance segmentation tasks, even in open-vocabulary and annotation-free contexts.

The results imply significant practical applications, offering a reliable, annotation-free segmentation methodology that can be extended to various computer vision tasks. The study also paves the way for further research into leveraging LLMs in dense prediction and other vision tasks, fostering exploration of more refined adaptations of pre-trained networks for diverse applications.

Moreover, this work suggests future exploration into refining pseudo-labeling techniques and integrating more sophisticated models to capitalize on the inherent knowledge embedded within large-scale language-image pre-training frameworks.

Conclusion

By successfully adapting CLIP's features for dense prediction, the authors open new avenues for semantic segmentation, challenging conventional approaches reliant on extensive annotations. The MaskCLIP and MaskCLIP+ models offer promising paths forward in semantic understanding within the field of computer vision, providing a robust foundation for ongoing innovation and exploration.

Markdown Report Issue