Learning Correspondence from the Cycle-Consistency of Time (1903.07593v2)

Published 18 Mar 2019 in cs.CV, cs.AI, and cs.LG

Abstract: We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model learns a feature map representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods.

Citations (472)

View on Semantic Scholar

Summary

The paper introduces a self-supervised method using cycle-consistency to learn visual correspondences without needing annotated data.
The framework employs a modified ResNet-50 and differentiable tracking components to robustly capture visual similarities across video frames.
Experimental results show strong generalization across tasks, rivaling supervised methods in segmentation, keypoint tracking, and optical flow estimation.

Learning Correspondence from the Cycle-consistency of Time

The paper, "Learning Correspondence from the Cycle-consistency of Time," presents a self-supervised method to learn visual correspondence using cycle-consistency as a supervisory signal. Driven by the fundamental importance of correspondence in computer vision, the authors developed a framework that avoids reliance on labeled data and instead leverages the inherent structure of video sequences.

Key Contributions

Self-Supervised Learning Approach: The authors introduce a novel approach employing cycle-consistency in time to learn visual representations. This method involves tracking a visual patch backwards and then forwards in time, using the inconsistency between start and endpoints as a loss function to supervise learning.
Generalization Across Tasks: The learned feature representation is tested across various correspondence tasks without fine-tuning, including video object segmentation, keypoint tracking, and optical flow estimation. The approach is shown to outperform prior self-supervised techniques and compete with some supervised methods.

Technical Implementation

The proposed method involves a differentiable tracking function constructed of three main components: an affinity function, a localizer, and a bilinear sampler. This composition allows the network to determine the location of a patch in a sequence of video frames using a trained feature space.

Feature Encoder: They use a modified ResNet-50 architecture to map video inputs into a feature space, optimized through end-to-end learning to identify visual similarities across frames.
Cycle-Consistency Loss: The framework uses multiple cycle-consistency losses, including long tracking cycle loss and skip-cycle loss, which exploit temporal continuity to robustly align visual features.

Experimental Evaluation

The authors evaluated the proposed approach using multiple datasets, demonstrating impressive results:

Video Object Segmentation (DAVIS-2017): Achieved competitive performance on instance mask propagation tasks, significantly outperforming other self-supervised methods.
Keypoint Tracking (JHMDB): Excellent results were reported in keypoint propagation tasks, closely matching supervised models trained on ImageNet in accuracy.
Semantic and Instance Propagation (VIP): The model was evaluated on longer-form videos, achieving commendable mIoU and instance-level accuracy.

Implications and Future Directions

The paper's approach provides a robust framework for learning visual correspondences without manual annotations, paving the way for large-scale video understanding in unconstrained settings.

Potential future developments could involve enhancing the model to better handle occlusions and employing improved patch selection strategies during training. Moreover, extending the methodology to exploit additional modalities, such as audio, could offer enriched representations, benefiting broader areas of video analysis.

The framework holds promise for advancing unsupervised learning paradigms and could contribute significantly to developments in areas requiring temporal visual coherence, including augmented reality and autonomous vehicle perception systems. The results highlight the latent potential in leveraging vast amounts of unlabeled video data, which may increasingly become a staple in AI research and application.