What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Published 15 Jul 2021 in cs.CV and cs.AI | (2107.07154v2)

Abstract: Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces TSPN, a network that predicts both relation likelihood and temporal spans for efficient video relation detection.
It combines deep learning-based detection and tracking with multi-modal feature fusion to streamline video relation modeling.
Evaluation on ImageNet-VidVRD and VidOR benchmarks demonstrates improved efficiency and competitive accuracy over existing methods.

Temporal Span Proposal Network for Video Relation Detection: An Academic Overview

The paper "What and When to Look?: Temporal Span Proposal Network for Video Relation Detection" proposes a novel approach to address the challenges in Video Visual Relation Detection (VidVRD). Unlike existing methods that struggle with efficiently modeling spatio-temporal dynamics in videos, this work introduces the Temporal Span Proposal Network (TSPN) to predict relations by determining both the likelihood of relations between objects (relationness) and when these relations occur (temporal span). This paper makes a significant contribution to improving both the effectiveness and efficiency of VidVRD tasks.

Key Contributions

Introduction of TSPN: The Temporal Span Proposal Network is designed to optimize video relationship detection by concurrently leveraging both temporal locality and globality of video content. TSPN identifies relations by calculating relationness scores, which helps in reducing the search space for potential interactions between object trajectories in the video.
Efficiency in Relation Detection: The TSPN method is computationally efficient compared to segment-based and window-based approaches. The proposed architecture theoretically doubles the detection efficiency and achieves this by ensuring each object pair is evaluated only once for potential relations across the video context.
Evaluation on Diverse Datasets: The efficacy of TSPN is validated on two benchmarks for VidVRD, namely ImageNet-VidVRD and VidOR. The method outperforms or matches the state-of-the-art approaches despite using less architecturally complicated means or external resources like language embeddings.
Insightful Feature Fusion: TSPN integrates multi-modal information, including visual, geometric, and semantic features, enhancing the model's capability to accurately predict complex and dynamic relations in videos. This approach underlines the importance of complementary feature use in relation detection tasks.

Methodological Insights

Object Trajectory Proposal: Utilizing deep learning-based detection and tracking methods, the proposed system efficiently segments object trajectories in video frames. Faster R-CNN and DeepSORT serve as the foundational components for preliminary object detection and tracking.
Relationness Scoring Module: By considering pairs of object trajectories, TSPN computes relationness scores to focus on pairs more likely to have interactions, significantly narrowing down the relational search space.
Temporal Span Prediction: Instead of relying exclusively on temporal slicing or predefined segment lengths as in prior methods, TSPN discretizes the temporal dimension to sectors, predicting relationship existence across these broader contexts efficiently.

Implications and Future Directions

The research presents substantial improvements in the computational efficiency of VidVRD, which is crucial for scaling to longer and more complex video datasets. In practical applications, such techniques will improve automated scene understanding in diverse fields, including surveillance, content moderation, and advanced video analytics.

From a theoretical perspective, TSPN's approach provides a foundational methodology that can be expanded upon for other dynamic vision tasks requiring contextual relational understanding. Future research might explore integrating real-time decision making or enhancing robustness across more varied datasets and spatio-temporal complexities without relying on traditionally intensive computations.

In summary, this paper advances our understanding of video-based relationship detection through the efficient Temporal Span Proposal Network. Its approach bridges the gap between efficiency and efficacy, laying the groundwork for further innovations in dynamic vision understanding.

Markdown Report Issue