Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos (2312.06729v3)

Published 11 Dec 2023 in cs.CV

Abstract: Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  2. Localizing moments in long video via multimodal guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13667–13678, 2023.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, pages 10638–10647, 2020.
  5. Vindlu: A recipe for effective video-and-language pretraining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  6. Dual encoding for video retrieval by text. IEEE TPAMI, pages 4065–4080, 2021.
  7. Uatvr: Uncertainty-adaptive text-video retrieval. arXiv preprint arXiv:2301.06309, 2023.
  8. Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019.
  9. Multi-modal transformer for video retrieval. In ECCV, pages 214–229, 2020.
  10. Clip2tv: Align, match and distill for video-text retrieval. arXiv preprint arXiv:2111.05610, 2021.
  11. Bridging video-text retrieval with multiple choice questions. In CVPR, pages 16167–16176, 2022.
  12. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  13. X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5006–5015, 2022.
  14. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, pages 18995–19012, 2022.
  15. Gratt-vis: Gated residual attention for auto rectifying video instance segmentation. arXiv preprint arXiv:2305.17096, 2023.
  16. Cone: An efficient coarse-to-fine alignment framework for long video temporal grounding. arXiv preprint arXiv:2209.10918, 2022.
  17. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  18. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023.
  19. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623, 2022.
  20. Expectation-maximization contrastive learning for compact video-and-language representations. NeurIPS, 2022.
  21. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  22. Instanceformer: An online video instance segmentation framework. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1188–1195, 2023.
  23. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021a.
  24. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7331–7341, 2021b.
  25. Egocentric video-language pretraining. arXiv preprint arXiv:2206.01670, 2022.
  26. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
  27. Reler@ zju-alibaba submission to the ego4d natural language queries challenge 2022. arXiv preprint arXiv:2207.00383, 2022a.
  28. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022b.
  29. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042–3051, 2022c.
  30. Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, pages 319–335, 2022d.
  31. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  32. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  33. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In ACM MM, pages 638–647, 2022.
  34. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016.
  35. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
  36. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023.
  37. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  38. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. arXiv preprint arXiv:2303.08345, 2023.
  39. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  40. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
  41. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021.
  42. Naq: Leveraging narrations as queries to supervise episodic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6694–6703, 2023.
  43. Vlg-net: Video-language graph matching network for video grounding. In ICCV, pages 3224–3234, 2021.
  44. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, pages 5026–5035, 2022.
  45. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022a.
  46. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022b.
  47. Unified coarse-to-fine alignment for video-text retrieval. In The IEEE International Conference on Computer Vision (ICCV), 2023.
  48. Vlm: Task-agnostic video-language model pre-training for video understanding. ACLliu2021hit, 2021a.
  49. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, pages 6787–6800, 2021b.
  50. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, pages 5036–5045, 2022.
  51. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. ICLR, 2023.
  52. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR, pages 3165–3173, 2017.
  53. A joint sequence fusion model for video question answering and retrieval. In ECCV, pages 471–487, 2018.
  54. Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868, 2023a.
  55. Helping hands: An object-aware ego-centric video recognition model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13901–13912, 2023b.
  56. Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931, 2020a.
  57. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022a.
  58. The elements of temporal sentence grounding in videos: A survey and future directions. arXiv preprint arXiv:2201.08071, 1(2), 2022b.
  59. Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, pages 12870–12877, 2020b.
  60. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
  61. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tanveer Hannan (9 papers)
  2. Md Mohaiminul Islam (13 papers)
  3. Thomas Seidl (25 papers)
  4. Gedas Bertasius (55 papers)

Summary

We haven't generated a summary for this paper yet.