Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos (2003.07048v1)

Published 16 Mar 2020 in cs.CV

Abstract: The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage. The proposed method leverages the idea of attentional reconstruction and directly scores the candidate segments with the learnt proposal-level attentions. Moreover, another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage. We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal representation and adopt 2D convolution to exploit inter-proposal clues for learning reliable attention map. Experiments on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our MARN over the existing weakly-supervised methods.

Citations (59)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos (2003.07048v1)

Summary

Related Papers