LITA: Language Instructed Temporal-Localization Assistant
Abstract: There has been tremendous progress in multimodal LLMs. Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv:2308.01390, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- The EPIC-KITCHENS Dataset: Collection, challenges and baselines. IEEE Trans. PAMI, 43(11):4125–4141, 2021.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 130:33–55, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- SlowFast networks for video recognition. In ICCV, 2019.
- Vtimellm: Empower llm to grasp video moments. arXiv preprint arXiv:2311.18445, 2023.
- Multimodal pretraining for dense video captioning. In AACL-IJCNLP, 2020.
- Thumos challenge: Action recognition with a large number of classes, 2014.
- Dense-captioning events in videos. In ICCV, 2017.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014.
- LISA: Reasoning segmentation via large language model. arXiv:2308.00692, 2023.
- MIMIC-IT: Multi-modal in-context instruction tuning. arXiv:2306.05425, 2023a.
- VideoChat: Chat-centric video understanding. arXiv:2305.06355, 2023b.
- Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023c.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
- Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- Valley: Video assistant with large language model enhanced ability, 2023.
- HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023.
- OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
- Momentor: Advancing video large language model with fine-grained temporal reasoning. arXiv preprint arXiv:2402.11435, 2024.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051, 2023.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019.
- LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
- LLaMA 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
- NExT-GPT: Any-to-any multimodal LLM. arXiv:2309.05519, 2023.
- NExT-QA: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
- UnLoc: A unified framework for video localization tasks. In ICCV, 2023.
- Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023a.
- LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv:2303.16199, 2023b.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv:2309.01219, 2023c.
- Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.