Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Published 28 Mar 2023 in cs.CV | (2303.16058v2)

Abstract: Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (108)

View on Semantic Scholar

Summary

The paper introduces the UnMasked Teacher strategy to transfer semantic knowledge from IFMs to VFMs, enhancing training efficiency.
It employs a two-stage pre-training approach that first builds temporal understanding from video data and then diversifies with vision-language data.
The model achieves state-of-the-art performance on action recognition and video-text tasks while reducing environmental impact by 70x.

Insightful Overview of "Unmasked Teacher: Towards Training-Efficient Video Foundation Models"

The paper "Unmasked Teacher: Towards Training-Efficient Video Foundation Models" addresses the challenges connected to training Video Foundation Models (VFMs) related to the high computational demands and data scarcity in comparison to Image Foundation Models (IFMs). By exploring an innovative methodology that emphasizes training efficiency without compromising performance, the authors propose a new approach called UnMasked Teacher (UMT) that aims to seamlessly integrate the benefits derived from existing methods such as VideoMAE.

Key Contributions and Methodology

The primary contribution of the paper is the introduction of a training-efficient methodology that enhances the transferability of semantic knowledge from IFMs to VFMs. This is achieved by leveraging an UnMasked Teacher strategy specialized for temporal-sensitive video understanding:

UnMasked Teacher (UMT) Strategy: The framework addresses the limitations of transferring IFM to VFM by avoiding the direct utilization of IFMs and instead using them as an UnMasked Teacher. Here, most video tokens with low semantic value are masked, and the unmasked tokens are selectively aligned with IFMs.
Progressive Pre-training Approach: Their two-stage training process begins with video data pre-training to establish a strong temporal understanding, followed by using vision-language data for model diversification. This ensures that the model not only understands video-specific nuances but is also equipped for video-language tasks.
Training Efficiency: The model is pre-trained using publicly available data in a relatively short timescale of just 6 days over 32 A100 GPUs. This streamlined process allows the model to achieve state-of-the-art performance on video-related tasks while significantly reducing carbon footprint when compared with other models such as CoCa.

Empirical Validation

The authors demonstrate the robustness of their approach through comprehensive experiments across a range of tasks:

Action Recognition: The approach attains significant improvements on benchmark datasets such as Kinetics, showing superior performance over previous models, particularly on complex, scene-related, and temporal tasks.
Spatiotemporal Localization and Video-Text Retrieval: The proposed method not only excels in conventional action recognition but also exhibits strong performance in spatiotemporal action localization and video-text retrieval, thus affirming its adaptability and versatility.
Environmentally Friendly Metrics: The model offers an environmentally viable solution with a 70x reduction in environmental impact when compared to models that rely on expansive datasets and compute resources.

Implications and Future Developments

This research has significant implications for future developments in artificial intelligence, especially in the burgeoning field of video understanding. The proposed model establishes a promising avenue for developing efficient, scalable, and environmentally friendlier VFMs. By mitigating the limitations of data dependency and computational costs, this work facilitates the scale-up of video foundation models, bearing an influence on real-world applications such as automated surveillance, multimedia retrieval, and entertainment.

The paper fosters future research directions where similar methodologies might be adapted to refine efficiency in other domains of AI. Moreover, the progressive pre-training framework could be expanded to accommodate diversified data modalities, pushing the envelope for comprehensive multi-modal learning systems.

In summary, the paper introduces an innovative and practical approach to video foundation model training, addressing key challenges with concrete methodological advancements that promise to reshape the landscape of video understanding in AI.

Markdown Report Issue