UntrimmedNets for Weakly Supervised Action Recognition and Detection

Published 9 Mar 2017 in cs.CV | (1703.03329v2)

Abstract: Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (483)

View on Semantic Scholar

Summary

The paper introduces UntrimmedNet, which leverages a dual-module design to learn action models and temporal extents in untrimmed videos.
It employs both uniform and shot-based sampling with Two-Stream CNNs or Temporal Segment Networks for efficient feature extraction.
Experimental results demonstrate competitive performance on THUMOS14 and ActivityNet, significantly reducing annotation costs.

UntrimmedNets for Weakly Supervised Action Recognition and Detection

The paper, "UntrimmedNets for Weakly Supervised Action Recognition and Detection," authored by Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool, introduces an innovative approach to the task of action recognition and detection within untrimmed video sequences. Traditional methods rely heavily on trimmed video datasets, which are both expensive and impractical to annotate at scale. This paper attempts to circumvent these limitations by proposing a weakly supervised architecture, termed UntrimmedNet, which is capable of learning directly from untrimmed videos without individual temporal annotations of action instances.

Architecture and Methodology

The UntrimmedNet framework couples two core components: a classification module and a selection module. The classification module is responsible for learning action models, while the selection module is dedicated to understanding the temporal extent of action instances within video sequences. These components are realized through feed-forward networks, allowing the entire architecture to be trained end-to-end.

UntrimmedNet begins with a process of generating clip proposals from untrimmed videos. Two sampling methods are evaluated: uniform sampling and shot-based sampling. The latter method leverages shot boundary detection to propose clips, potentially improving proposal quality by preserving temporal coherence.

Once clip proposals are generated, feature extraction is performed using either Two-Stream CNNs or Temporal Segment Networks. The classification module then predicts scores for each clip, while the selection module identifies and ranks clip proposals, utilizing techniques such as hard selection (top-k pooling) and soft selection (attention weights).

Experimental Results

The authors conduct extensive experiments on the THUMOS14 and ActivityNet datasets. These datasets contain challenging untrimmed videos, making them ideal for evaluating the efficacy of UntrimmedNet. Both weakly supervised action recognition and detection tasks are considered.

Action Recognition: UntrimmedNet demonstrates superior performance over existing strongly supervised methods, enhancing or maintaining competitive accuracy. In particular, the Temporal Segment Network with soft selection achieved 74.2% accuracy on THUMOS14 and 86.9% on the validation set of ActivityNet.
Action Detection: Although the system requires only video-level labels during training, it achieves comparable results to methods utilizing strong supervision, highlighting the robustness and practical viability of the proposed approach.

Theoretical and Practical Implications

This work presents significant implications in both the theoretical exploration and practical implementation of action recognition systems. From a theoretical perspective, UntrimmedNet showcases an effective way to combine learning tasks, demonstrating how classification and selection can be integrated to tackle weak supervision challenges.

Practically, the reduction in annotation cost and complexity paves the way for scaling action recognition systems to larger datasets prevalent on platforms like YouTube. The architecture's ability to perform recognition and detection without exhaustive temporal annotations is particularly advantageous for developing real-time applications.

Future Directions

The paper's contributions could be extended by exploring alternative models for the classification and selection modules, or by integrating more sophisticated attention mechanisms to further improve detection precision. Additionally, the application of UntrimmedNet to other domains requiring temporal reasoning, such as multi-agent interaction in videos, could be a valuable avenue for future research.

Overall, UntrimmedNet provides a compelling approach to address the limitations of traditional action recognition methods, demonstrating robust performance with minimal supervision. This paper represents a substantial step forward in the development of scalable video analysis systems.

Markdown Report Issue