VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT (2403.02076v1)
Abstract: Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT
- \bibinfojournalDetecting Moments and Highlights in Videos via Natural Language Queries. NeurIPS 2021, 34, 11846–11858.
- Zero-shot Video Moment Retrieval with Off-the-Shelf Models. In Transfer Learning for Natural Language Processing Workshop; PMLR: New Orleans, LA, USA ,2023; pp. 10–21.
- Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt (accessed on 1 December 2023).
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288.
- Baichuan 2: Open Large-scale Language Models. arXiv 2023, arXiv:2309.10305.
- \bibinfojournalMiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478.
- Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485.
- LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 2798–2803.
- UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3042–3051.
- \bibinfojournal MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer. arXiv 2023, arXiv:2305.00355.
- \bibinfojournal Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding. IJSWIS 2023, 19, 20. https://doi.org/10.4018/IJSWIS.332768.
- Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23045–23055.
- UniVTG: Towards Unified Video-Language Temporal Grounding. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2794–2804.
- GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features. IEEE Signal Process. Lett. 2023, 31, 521–525. https://doi.org/10.1109/LSP.2023.3340103. .
- Knowing Where to Focus: Event-aware Transformer for Video Grounding. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13846–13856.
- Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12870–12877.
- Zero-shot natural language video localization. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1470–1479.
- Unsupervised temporal video grounding with deep semantic clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 1683–1691.
- \bibinfojournalLearning Video Moment Retrieval Without a Single Annotated Video. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1646–1657. https://doi.org/10.1109/TCSVT.2021.3075470.
- Prompt-based Zero-shot Video Moment Retrieval. In Proceedings of the The 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 413–421. https://doi.org/10.1145/3503161.3548004.
- Language-free Training for Zero-shot Video Grounding. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2539–2548.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763.
- \bibinfojournalZero-shot video moment retrieval from frozen vision-language models. arXiv 2023, arXiv:2309.00661.
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv 2022, arXiv:2212.03191.
- Zero-Shot Video Moment Retrieval Using BLIP-Based Models. In Proceedings of the International Symposium on Visual Computing, Lake Tahoe, NV, USA, 16–18 October 2023; pp. 160–171.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009.
- \bibinfojournalSelf-Chained Image-Language Model for Video Localization and Question Answering. arXiv 2023, arXiv:2305.06988.
- WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv 2023, arXiv:2304.12244.
- \bibinfojournalLlama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971.
- \bibinfojournalMiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv 2023, arXiv:2304.10592.
- Human behavior inspired machine reading comprehension. In Proceedings of the The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 425–434.
- \bibinfojournalVideomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093.
- \bibinfojournalRoberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084.
- A Survey on Evaluation of Large Language Models. arXiv 2023, arXiv:2307.03109.
- Tall: Temporal activity localization via language query. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5267–5275.
- Dense-captioning events in videos. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 706–715.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 510–526.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 961–970.
- Weakly supervised video moment localization with contrastive negative sample mining. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3517–3525.
- Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022.
- Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 538–546. https://doi.org/10.1145/3581783.3612384.
- Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18908–18918.
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv 2023, arXiv:2306.05424.
- Pyramid Feature Attention Network for Monocular Depth Prediction. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6.
- Dual Attention Feature Fusion Network for Monocular Depth Estimation. In Proceedings of the CAAI International Conference on Artificial Intelligence, Hangzhou, China, 5–6 June 2021; pp. 456–468.
- Transient-steady state vibration characteristics and influencing factors under no-load closing conditions of converter transformers. \bibinfojournalInt. J. Electr. Power Energy Syst. 2024, 155, 109497.