Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding (2401.00901v2)
Abstract: Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.
- End-to-end object detection with transformers. In ECCV, 2020.
- What, when, and where? – self-supervised spatio-temporal grounding in untrimmed multi-action videos from narrated instructions. arXiv preprint arXiv:2303.16990, 2023a.
- Localizing natural language in videos. In AAAI, 2019a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549, 2019b.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
- Tall: Temporal activity localization via language query. In ICCV, 2017.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Embracing consistency: A one-stage approach for spatio-temporal video grounding. In NeurIPS, 2022.
- Grounded language-image pre-training. In CVPR, 2022.
- Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In CVPR, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
- Generalized intersection over union. In CVPR, 2019.
- Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV, 2020.
- Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
- Annotating objects and relations in user-generated videos. In ICMR, 2019.
- Stvgbert: A visual- linguistic transformer based framework for spatio-temporal video grounding. In ICCV, 2021.
- Augmented 2d-tan: A two-stage approach for human-centric spatio-temporal video grounding. arXiv preprint arXiv:2106.10634, 2021a.
- Look at what i’m doing: Self-supervised spatial grounding of narrations in instructional videos, 2021b.
- Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022.
- Spatio-temporal person retrieval via natural language queries. In ICCV, 2017.
- Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, 2022.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- 2rd place solutions in the hc-stvg track of person in context challenge 2021. arXiv preprint arXiv:2106.07166, 2021.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, 2020.
- Object-aware multi-branch relation networks for spatio-temporal video grounding. In IJCAI, 2021.
- Weakly-supervised video object grounding from text by loss weighting and object interaction. In BMVC, 2018.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.