Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval (2110.15609v3)

Published 29 Oct 2021 in cs.CV and cs.IR

Abstract: The task of text-video retrieval aims to understand the correspondence between language and vision, has gained increasing attention in recent years. Previous studies either adopt off-the-shelf 2D/3D-CNN and then use average/max pooling to directly capture spatial features with aggregated temporal information as global video embeddings, or introduce graph-based models and expert knowledge to learn local spatial-temporal relations. However, the existing methods have two limitations: 1) The global video representations learn video temporal information in a simple average/max pooling manner and do not fully explore the temporal information between every two frames. 2) The graph-based local video representations are handcrafted, it depends heavily on expert knowledge and empirical feedback, which may not be able to effectively mine the higher-level fine-grained visual relations. These limitations result in their inability to distinguish videos with the same visual components but with different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatial-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple transformer blocks and additional residual blocks to learn spatio-temporal relation features, calling the module a Spatio-Temporal Residual transformer (SRT). Meanwhile, Global video representations are encoded using a multi-layer transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval.

Citations (6)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.