Unsupervised Learning of Dense Visual Representations

Published 11 Nov 2020 in cs.CV | (2011.05499v2)

Abstract: Contrastive self-supervised learning has emerged as a promising approach to unsupervised visual representation learning. In general, these methods learn global (image-level) representations that are invariant to different views (i.e., compositions of data augmentation) of the same image. However, many visual understanding tasks require dense (pixel-level) representations. In this paper, we propose View-Agnostic Dense Representation (VADeR) for unsupervised learning of dense representations. VADeR learns pixelwise representations by forcing local features to remain constant over different viewing conditions. Specifically, this is achieved through pixel-level contrastive learning: matching features (that is, features that describes the same location of the scene on different views) should be close in an embedding space, while non-matching features should be apart. VADeR provides a natural representation for dense prediction tasks and transfers well to downstream tasks. Our method outperforms ImageNet supervised pretraining (and strong unsupervised baselines) in multiple dense prediction tasks.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (183)

View on Semantic Scholar

Summary

The paper introduces VADeR, a novel unsupervised method that learns dense pixel-level representations using contrastive loss in an encoder-decoder framework.
It leverages multi-scale features and dynamic negative sampling to achieve significant improvements in semantic segmentation (mIoU) and depth prediction (RMSE).
The findings advocate aligning unsupervised learning with pixel-level tasks, paving the way for advancements in areas like medical imaging and autonomous navigation.

Unsupervised Learning of Dense Visual Representations: VADeR

The paper presents a novel approach to unsupervised visual representation learning, targeting dense (pixel-level) representations essential for various visual understanding tasks. The framework introduced, termed View-Agnostic Dense Representation (VADeR), is designed to improve upon conventional methods, which predominantly focus on global representations through contrastive learning. This essay provides an expert analysis of VADeR, detailing the methodology, results, and implications for future research.

Background

Historically, computer vision advancements have been reliant on supervised learning, tuned on large-scale labeled datasets like ImageNet. Recently, self-supervised and unsupervised learning methods, particularly those founded on contrastive learning principles, have gained traction as a means to exploit the abundance of unlabeled data. While these techniques have been successful in acquiring global image representations, they falter in dense prediction tasks that require detailed pixel-level understanding, such as image segmentation and depth prediction.

VADeR: Methodology

VADeR differentiates itself by focusing on pixelwise representations, utilizing an encoder-decoder architecture to compute similarity scores at the pixel level rather than image-level pooled representations. The paper describes this process as leveraging perceptual constancy, ensuring that local pixel-level features remain invariant across different views of a scene. VADeR employs pixel-level contrastive learning, based on known pixel correspondences derived from data augmentation processes, to enforce similarity between matching features and dissimilarity between non-matching ones.

The architecture integrates feature pyramid networks (FPN) to produce multi-scale features, which are then aggregated to derive dense representations conducive for structured tasks. An important component of VADeR's training is its contrastive loss formulation—adapted to pixel-level features—with negative sampling efficiently managed via a dynamic dictionary as part of a momentum-based moving average framework.

Numerical Results

The experimental results underscore VADeR's superiority over established unsupervised baselines like MoCo and the ImageNet-supervised pretraining approach across dense prediction tasks. Significant improvements were noted in semantic segmentation (measured in mIoU) and depth prediction (RMSE), with VADeR's unsupervised pretraining even surpassing some supervised benchmarks. Moreover, VADeR demonstrated competitive performance in video instance segmentation, indicating robust feature transferring capacity.

Implications and Future Research

VADeR's advancements in unsupervised dense representations exemplify a shift towards models better aligned with pixel-level tasks. The implications are manifold, enabling improved performance in scenarios with limited labeled data, and unlocking practical applications in domains demanding fine-grained visual understanding, such as medical imaging and autonomous navigation.

The findings advocate for a broader evaluation framework in self-supervised learning research, emphasizing the need to align representation learning targets with downstream tasks. Future developments should explore further optimization of dense feature extraction processes and explore hybrid frameworks that integrate both global and dense teachings to accommodate diverse applications.

VADeR establishes a foundation for subsequent explorations in dense visual representations without reliance on annotated datasets, propelling advancements in AI applications that mirror human-like perceptual constancy and understanding.

Markdown Report Issue