VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples (2103.05905v2)

Published 10 Mar 2021 in cs.CV, cs.LG, and cs.MM

Abstract: MoCo is effective for unsupervised image representation learning. In this paper, we propose VideoMoCo for unsupervised video representation learning. Given a video sequence as an input sample, we improve the temporal feature representations of MoCo from two perspectives. First, we introduce a generator to drop out several frames from this sample temporally. The discriminator is then learned to encode similar feature representations regardless of frame removals. By adaptively dropping out different frames during training iterations of adversarial learning, we augment this input sample to train a temporally robust encoder. Second, we use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. As the momentum encoder updates after keys enqueue, the representation ability of these keys degrades when we use the current input sample for contrastive learning. This degradation is reflected via temporal decay to attend the input sample to recent keys in the queue. As a result, we adapt MoCo to learn video representations without empirically designing pretext tasks. By empowering the temporal robustness of the encoder and modeling the temporal decay of the keys, our VideoMoCo improves MoCo temporally based on contrastive learning. Experiments on benchmark datasets including UCF101 and HMDB51 show that VideoMoCo stands as a state-of-the-art video representation learning method.

Authors (5)

Tian Pan (3 papers)
Yibing Song (65 papers)
Tianyu Yang (67 papers)
Wenhao Jiang (40 papers)
Wei Liu (1136 papers)

Citations (214)

View on Semantic Scholar

Summary

The paper introduces temporally adversarial learning to force the encoder to generate robust features despite dropped video frames.
It implements a temporal decay mechanism that down-weights older keys, refining contrastive loss for current video inputs.
VideoMoCo outperforms state-of-the-art methods on datasets like UCF101 and HMDB51, indicating strong potential for advanced video analytics.

VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

The paper presents a significant advancement in the field of unsupervised video representation learning by introducing a novel framework called VideoMoCo. This work extends upon the foundation laid by Momentum Contrast (MoCo), which has been primarily used for unsupervised image representation. The authors propose two key innovations to adapt MoCo for video representation: temporally adversarial learning and temporal decay. These enhancements are aimed at improving the temporal feature robustness and representation quality of MoCo in handling video data.

Key Contributions

Temporally Adversarial Learning: The primary innovation in VideoMoCo is the introduction of a generator and discriminator model aimed at enhancing MoCo's temporal feature robustness. The generator selectively drops frames from video sequences during training in an adversarial manner, forcing the discriminator (MoCo encoder) to produce similar feature representations despite missing frames. This approach leverages adversarial learning to enhance the temporal robustness of the encoder without relying on pretext tasks.
Temporal Decay Mechanism: VideoMoCo incorporates a temporal decay model to address the degradation of representation ability of keys stored in the memory queue. This decay mechanism dynamically attenuates the contribution of older keys based on their time in the queue, thus refining the contrastive loss computation and aligning more accurately with the current state of input samples.

Experimental Validation

Experiments demonstrate the efficacy of VideoMoCo on benchmark datasets such as UCF101 and HMDB51. VideoMoCo's performance surpasses current state-of-the-art video representation learning techniques, highlighting its superior capability in capturing temporal dynamics in video data. The encoder trained via this framework exhibits enhanced robustness, which could lead to improved downstream tasks in diverse video recognition applications.

Theoretical and Practical Implications

The introduction of temporally adversarial learning and temporal decay presents significant enhancements to contrastive learning in video contexts. The adversarial framework enables VideoMoCo to naturally augment and handle temporal shifts in video data, while temporal decay ensures the encoder remains sensitive to the most relevant temporal features.

Furthermore, by eliminating the need for manually designed pretext tasks, VideoMoCo offers a task-agnostic framework that could streamline the development workflows for unsupervised learning models. Future work could explore the integration of VideoMoCo with more complex architectures and an expanded range of video-centric tasks, potentially improving the generalizability and efficiency of unsupervised video representations.

In conclusion, VideoMoCo represents an important step forward in the adaptation of contrastive learning from static image domains to dynamic video environments. Its innovative methodologies could influence future research directions in unsupervised video learning and beyond, potentially impacting video analytics, surveillance, and multimedia applications.

PDF Markdown