RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos (2312.06729v3)
Abstract: Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
- Localizing moments in long video via multimodal guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13667–13678, 2023.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, pages 10638–10647, 2020.
- Vindlu: A recipe for effective video-and-language pretraining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Dual encoding for video retrieval by text. IEEE TPAMI, pages 4065–4080, 2021.
- Uatvr: Uncertainty-adaptive text-video retrieval. arXiv preprint arXiv:2301.06309, 2023.
- Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019.
- Multi-modal transformer for video retrieval. In ECCV, pages 214–229, 2020.
- Clip2tv: Align, match and distill for video-text retrieval. arXiv preprint arXiv:2111.05610, 2021.
- Bridging video-text retrieval with multiple choice questions. In CVPR, pages 16167–16176, 2022.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5006–5015, 2022.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995–19012, 2022.
- Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096, 2023.
- Cone: An efficient coarse-to-fine alignment framework for long video temporal grounding. arXiv preprint arXiv:2209.10918, 2022.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023.
- Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623, 2022.
- Expectation-maximization contrastive learning for compact video-and-language representations. NeurIPS, 2022.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- Instanceformer: An online video instance segmentation framework. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1188–1195, 2023.
- Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021a.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341, 2021b.
- Egocentric video-language pretraining. arXiv preprint arXiv:2206.01670, 2022.
- Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
- Reler@ zju-alibaba submission to the ego4d natural language queries challenge 2022. arXiv preprint arXiv:2207.00383, 2022a.
- Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022b.
- Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042–3051, 2022c.
- Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, pages 319–335, 2022d.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM MM, pages 638–647, 2022.
- The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016.
- Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
- Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Scanning only once: An end-to-end framework for fast temporal grounding in long videos. arXiv preprint arXiv:2303.08345, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
- Learning transferable visual models from natural language supervision. pages 8748–8763, 2021.
- Naq: Leveraging narrations as queries to supervise episodic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6694–6703, 2023.
- Vlg-net: Video-language graph matching network for video grounding. In ICCV, pages 3224–3234, 2021.
- Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, pages 5026–5035, 2022.
- All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022a.
- Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022b.
- Unified coarse-to-fine alignment for video-text retrieval. In The IEEE International Conference on Computer Vision (ICCV), 2023.
- Vlm: Task-agnostic video-language model pre-training for video understanding. ACLliu2021hit, 2021a.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, pages 6787–6800, 2021b.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, pages 5036–5045, 2022.
- Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. ICLR, 2023.
- End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR, pages 3165–3173, 2017.
- A joint sequence fusion model for video question answering and retrieval. In ECCV, pages 471–487, 2018.
- Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868, 2023a.
- Helping hands: An object-aware ego-centric video recognition model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13901–13912, 2023b.
- Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931, 2020a.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022a.
- The elements of temporal sentence grounding in videos: A survey and future directions. arXiv preprint arXiv:2201.08071, 1(2), 2022b.
- Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, pages 12870–12877, 2020b.
- Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Tanveer Hannan (9 papers)
- Md Mohaiminul Islam (13 papers)
- Thomas Seidl (25 papers)
- Gedas Bertasius (55 papers)