A Simple LLM Framework for Long-Range Video Question-Answering (2312.17235v3)

Published 28 Dec 2023 in cs.CV

Abstract: We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a LLM (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

References (83)

Citations (47)

View on Semantic Scholar

Summary

The paper introduces LLoVi, a simplified framework that segments long videos into captioned clips before leveraging LLM reasoning for efficient video question answering.
The study demonstrates that a specialized LLM prompt and precise visual captioning significantly boost performance, outperforming previous methods on the EgoSchema benchmark.
The framework’s design enables zero-shot learning and robust generalization across diverse datasets, marking a promising direction for future video understanding research.

Introduction to Long-Range Video Question-Answering

The field of AI has made considerable advancements in understanding short video clips, typically ranging from a few seconds to a minute. However, comprehending longer video sequences – which can span several minutes or even hours – introduces complex challenges. To enhance long-range video understanding, especially for question-answering tasks, researchers generally rely on intricate models equipped with advanced temporal reasoning capabilities. Traditional methods invest heavily in specialized video modeling designs, such as long-range feature banks and space-time graphs, which are costly and intricate.

A Simplified Framework Using LLMs

A paper introduces a novel, language-based framework that simplifies the approach to long-range video question-answering (LVQA). Known as LLoVi, this framework uniquely combines a short-term visual captioner with a LLM like GPT-3.5 or GPT-4, successfully leveraging the LLM's powerful ability for long-range reasoning. Instead of incorporating complex video-specific techniques, LLoVi harnesses two stages: initially, it segments a long video into short clips, each described textually by a visual captioner. Subsequently, an LLM integrates these descriptions to perform comprehensive video reasoning and answer questions about the video content.

Crucial Factors and Methodology Insights

An extensive empirical paper within the paper highlights several critical components for effective LVQA performance. The choice of both the visual captioner and the LLM proved to be significant. It was further discovered that a specialized LLM prompt structure substantially elevates performance. This prompt instructs the LLM to first deliver a consolidated summary of the video captions, which simplifies the task of accurately responding to questions based on this synthesized narrative. Remarkably, this framework demonstrated superior accuracy on the EgoSchema benchmark, surpassing former leading techniques by considerable margins.

Generalization and Grounded Question-Answering

This streamlined framework proved its robustness across a variety of datasets, indicating its applicability to diverse LVQA scenarios. Moreover, the researchers extended the framework to 'grounded LVQA', where the model identifies and grounds the specific video segment relevant to a question. This extension led to the framework outperforming existing methods on a benchmark designed for this purpose.

Conclusion

The simplicity and zero-shot learning ability of LLoVi make it a promising direction for future development in video understanding. Details about the paper can be examined more closely in the literature, and the code implementation for this framework is openly available, which benefits the research community. By abstaining from complicated video-specific mechanisms, LLoVi empowers LLMs to tap into their innate long-range reasoning and does so with noteworthy efficiency and efficacy.

PDF Markdown

GitHub

GitHub - CeeZh/LLoVi: Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering" (81 stars)

Tweets

https://twitter.com/cezhhh/status/1744742455080362408

https://twitter.com/cezhhh/status/1856366161207931162