Large-scale weakly-supervised pre-training for video action recognition

Published 2 May 2019 in cs.CV | (1905.00561v1)

Abstract: Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?

Abstract PDF Upgrade to Chat

Authors (6)

Citations (292)

View on Semantic Scholar

Summary

The paper demonstrates that using large-scale, noisy web videos significantly improves transfer learning performance, achieving state-of-the-art accuracy on Kinetics and EPIC-Kitchens.
It introduces a novel verb-object label space and explores spatio-temporal pre-training to address the challenges of temporal localization and spatial context in video data.
Experiments reveal a log-linear improvement in accuracy with increased data volume, underscoring the potential of weak supervision for scalable video action recognition.

Large-scale Weakly-supervised Pre-training for Video Action Recognition

The paper "Large-scale weakly-supervised pre-training for video action recognition" investigates the efficacy of leveraging large volumes of web videos for pre-training video models in the context of action recognition tasks. The authors focus on a dataset of over 65 million public user-generated videos from social media, enriched with noisy temporal and label information. They propose that this scale of weak supervision significantly enhances transfer learning performance across various challenging video action recognition datasets including Kinetics, EPIC-Kitchens, and Something-Something.

Key Contributions

The study explores several pivotal questions related to constructing and utilizing weakly-supervised video action datasets:

Verb-object Label Space Construction: The complexity of video actions often involves intricate interactions between subjects and objects. The paper examines how a verb-object pre-training label space influences transfer learning efficacy, considering the marginal versus joint distributions of these labels.
Spatial-Temporal Features: A notable consideration is whether it is more beneficial to pre-train for spatio-temporal features rather than solely relying on frame-based models that have historically performed well on action recognition tasks.
Temporal Localization: The team investigates the localization of actions within varying video lengths, addressing whether short or long videos are more beneficial for pre-training under constraints of video number or total duration.

Empirical Findings

The experimental results underscore the notable benefits of weakly-supervised large-scale pre-training:

Pre-training on 65 million videos notably improved the state-of-the-art results by achieving a top-1 accuracy of 81.3% on Kinetics, outperforming previous benchmarks by 3.6%. On EPIC-Kitchens, the approach achieved a significant accuracy enhancement of 4.6% on unseen test data.
Scale and Capacity: The performance progressively improves with the expansion of pre-training datasets, exhibiting what the authors describe as a log-linear relationship between data volume and model accuracy. Model capacity also plays a critical role, with increased depth yielding enhanced performance, although saturation occurs at higher capacities, suggesting potential data bottlenecks.
Pre-training Label Space: Experimentation reveals that target datasets benefit most when the pre-training labels have high overlap with target task labels. A diverse but skewed distribution in the pre-training label space, such as verb-noun combinations, did not necessarily translate to improved performance, emphasizing the nuanced balance needed in constructing pre-training datasets.
Temporal Dynamics: Longer videos provide content diversity that seems to outweigh the benefits of better temporal localization offered by shorter clips. However, given a fixed budget of total video minutes, selecting short videos benefits temporal localization, corroborating expectations about action density in video data.

Practical and Theoretical Implications

The study's findings advocate for exploiting large-scale, noisy datasets to strengthen feature representations in video models, challenging the traditional reliance on manually curated datasets. This method reduces reliance on costly annotations, making it extensible to larger scales, potentially unlocking greater richness in video data applications.

Moreover, the exploration into label spaces and temporal dynamics underscores the complexity of video data beyond static imagery, encouraging further investigation into domain specialization of models and advancements in efficient data processing strategies.

Future Directions

The paper opens avenues for deeper explorations into weak supervision in video learning, particularly in areas such as diverse and adaptable label space construction and optimizing the trade-offs between temporal precision and content diversity. These directions promise to refine approaches in video action recognition, potentially impacting applications like surveillance, content recommendation, and automated video editing.

Overall, this paper provides valuable insights into scaling weak supervision for video pre-training, pushing the boundaries of how action understanding is approached with vast, noisily labeled datasets.

Markdown Report Issue