VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT (2403.02076v1)
Abstract: Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT
- \bibinfojournalDetecting Moments and Highlights in Videos via Natural Language Queries. NeurIPS 2021, 34, 11846–11858.
- Zero-shot Video Moment Retrieval with Off-the-Shelf Models. In Transfer Learning for Natural Language Processing Workshop; PMLR: New Orleans, LA, USA ,2023; pp. 10–21.
- Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt (accessed on 1 December 2023).
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288.
- Baichuan 2: Open Large-scale Language Models. arXiv 2023, arXiv:2309.10305.
- \bibinfojournalMiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478.
- Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485.
- LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 2798–2803.
- UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3042–3051.
- \bibinfojournal MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer. arXiv 2023, arXiv:2305.00355.
- \bibinfojournal Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding. IJSWIS 2023, 19, 20. https://doi.org/10.4018/IJSWIS.332768.
- Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23045–23055.
- UniVTG: Towards Unified Video-Language Temporal Grounding. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2794–2804.
- GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features. IEEE Signal Process. Lett. 2023, 31, 521–525. https://doi.org/10.1109/LSP.2023.3340103. .
- Knowing Where to Focus: Event-aware Transformer for Video Grounding. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13846–13856.
- Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12870–12877.
- Zero-shot natural language video localization. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1470–1479.
- Unsupervised temporal video grounding with deep semantic clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 1683–1691.
- \bibinfojournalLearning Video Moment Retrieval Without a Single Annotated Video. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1646–1657. https://doi.org/10.1109/TCSVT.2021.3075470.
- Prompt-based Zero-shot Video Moment Retrieval. In Proceedings of the The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 413–421. https://doi.org/10.1145/3503161.3548004.
- Language-free Training for Zero-shot Video Grounding. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2539–2548.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763.
- \bibinfojournalZero-shot video moment retrieval from frozen vision-language models. arXiv 2023, arXiv:2309.00661.
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv 2022, arXiv:2212.03191.
- Zero-Shot Video Moment Retrieval Using BLIP-Based Models. In Proceedings of the International Symposium on Visual Computing, Lake Tahoe, NV, USA, 16–18 October 2023; pp. 160–171.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009.
- \bibinfojournalSelf-Chained Image-Language Model for Video Localization and Question Answering. arXiv 2023, arXiv:2305.06988.
- WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv 2023, arXiv:2304.12244.
- \bibinfojournalLlama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971.
- \bibinfojournalMiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592.
- Human behavior inspired machine reading comprehension. In Proceedings of the The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 425–434.
- \bibinfojournalVideomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093.
- \bibinfojournalRoberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084.
- A Survey on Evaluation of Large Language Models. arXiv 2023, arXiv:2307.03109.
- Tall: Temporal activity localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5267–5275.
- Dense-captioning events in videos. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 706–715.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 510–526.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 961–970.
- Weakly supervised video moment localization with contrastive negative sample mining. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3517–3525.
- Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022.
- Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 538–546. https://doi.org/10.1145/3581783.3612384.
- Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18908–18918.
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv 2023, arXiv:2306.05424.
- Pyramid Feature Attention Network for Monocular Depth Prediction. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6.
- Dual Attention Feature Fusion Network for Monocular Depth Estimation. In Proceedings of the CAAI International Conference on Artificial Intelligence, Hangzhou, China, 5–6 June 2021; pp. 456–468.
- Transient-steady state vibration characteristics and influencing factors under no-load closing conditions of converter transformers. \bibinfojournalInt. J. Electr. Power Energy Syst. 2024, 155, 109497.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.