Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning (2212.04500v2)

Published 8 Dec 2022 in cs.CV

Abstract: Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel RGB values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. Motivated by this observation, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous supervised or self-supervised methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 76.7% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 2.4% respectively. When a larger ViT-Huge model is adopted, MVD achieves the state-of-the-art performance with 77.3% Top-1 accuracy on Something-Something-v2 and 41.1 mAP on AVA v2.2. Code will be available at \url{https://github.com/ruiwang2021/mvd}.

Citations (75)

View on Semantic Scholar

Summary

The paper presents a novel Masked Video Distillation (MVD) framework that uses a two-stage strategy to distill high-level features from both image and video teacher models.
It achieves notable improvements in video classification accuracy, with ViT-Large reaching 86.4% on Kinetics-400 and 76.7% on Something-Something-v2.
The method offers scalable insights for advancing self-supervised video transformers by effectively integrating spatial and temporal feature modeling.

An Examination of Masked Video Distillation for Self-supervised Video Representation Learning

The paper "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning" explores the domain of self-supervised learning for video representation by introducing a novel framework called Masked Video Distillation (MVD). This approach tackles the challenge of obtaining effective video representations by employing a two-stage masked feature modeling strategy, harnessing the power of both image and video teacher models.

Technical Approach

MVD separates itself from conventional self-supervised techniques that predominantly focus on low-level feature reconstruction, such as pixel values or VQVAE tokens, which often suffer from redundancy and noise. Instead, this research leverages high-level features extracted from pretrained models, known as teacher models, as the target for video representation learning. The method encompasses the following stages:

Pretraining of Teacher Models: This stage involves training image or video models through masked feature modeling using high-level feature reconstructions obtained from masked patches.
Masked Feature Distillation: In this stage, student models are trained by distilling knowledge from the teacher models. The paper highlights two types of teachers—image teachers, which excel in capturing spatial information, and video teachers, which effectively capture temporal dynamics. MVD leverages the advantages of each through a spatial-temporal co-teaching strategy, which distills and integrates features from both types of teachers to improve representation learning.

Key Findings and Results

The empirical findings demonstrate that students distilled from video teachers tend to perform better on tasks emphasizing temporal aspects, whereas those distilled from image teachers perform robustly on spatially-focused tasks. This relationship is quantified by observing cross-frame feature similarities, showing that video teachers capture more temporal dynamics as reflected in feature patterns across different frames.

The paper reports strong experimental results, particularly the significant improvement in classification accuracy for video datasets such as Kinetics-400 and Something-Something-v2 when using MVD over baseline models like VideoMAE. Notably, MVD with the ViT-Large model achieved a notable top-1 accuracy of 86.4% and 76.7% on Kinetics-400 and Something-Something-v2, respectively, outperforming state-of-the-art methods by considerable margins.

Implications and Future Prospects

The work presents practical implications for the development of self-supervised video transformers. By offering a way to harness the strengths of both spatial and temporal teachers simultaneously, MVD presents a scalable path for advancing video representation quality. Theoretically, the paper contributes to understanding how different aspects of video data can be effectively modeled using variations in supervised learning strategies. Moreover, it opens up avenues for exploring more complex structures in multimodal learning, possibly extending beyond high-level feature distillations.

In the future, research can expand on this approach by involving even larger datasets and teacher models pretrained on diverse datasets to further enhance the adaptability of student models to varied video tasks. Additionally, exploring methods to distill features without fixed pretrained models could present a streamlined and flexible solution for self-supervised learning in video understanding.

In summary, this paper provides an informed and nuanced investigation into the potential of using high-level feature targets for self-supervised learning of video representations. The insights and methods developed have far-reaching impacts for both practical applications in the field of computer vision and theoretical advancements in machine learning methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - ruiwang2021/mvd: [CVPR2023] Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning (https://arxiv.org/abs/2212.04500) (101 stars)

YouTube

Show All Videos