Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 118 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization (2107.12589v1)

Published 27 Jul 2021 in cs.CV

Abstract: Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Both appearance and motion features are used in previous works, while they do not utilize them in a proper way but apply simple concatenation or score-level fusion. In this work, we argue that the features extracted from the pretrained extractor, e.g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Therefore, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information of the auxiliary modality. Moreover, we treat the attention weights derived from each CCMas the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two common used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method and achieve the state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

Citations (76)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.