Emergent Mind

A Simple LLM Framework for Long-Range Video Question-Answering

(2312.17235)
Published Dec 28, 2023 in cs.CV

Abstract

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

LLoVi framework uses LLMs and visual captioners for effective long-range video question-answering.

Overview

  • AI has improved in understanding short video clips but struggles with long-range videos. Advanced models for video question-answering are complex and costly.

  • LLoVi, a new framework, combines a visual captioner with LLMs like GPT-3.5 or GPT-4, to simplify long-range video question-answering.

  • Performance depends significantly on the visual captioner, the LLM, and the design of the LLM prompts. LLoVi excels in accuracy on the EgoSchema benchmark.

  • LLoVi is versatile, performing well across various datasets and has been extended to grounded LVQA, where it identifies relevant video segments for questions.

  • LLoVi's simplicity and zero-shot learning demonstrate its promise. Its code is available publicly, aiding in research and development.

Introduction to Long-Range Video Question-Answering

The field of AI has made considerable advancements in understanding short video clips, typically ranging from a few seconds to a minute. However, comprehending longer video sequences – which can span several minutes or even hours – introduces complex challenges. To enhance long-range video understanding, especially for question-answering tasks, researchers generally rely on intricate models equipped with advanced temporal reasoning capabilities. Traditional methods invest heavily in specialized video modeling designs, such as long-range feature banks and space-time graphs, which are costly and intricate.

A Simplified Framework Using Language Models

A recent study introduces a novel, language-based framework that simplifies the approach to long-range video question-answering (LVQA). Known as LLoVi, this framework uniquely combines a short-term visual captioner with a Large Language Model (LLM) like GPT-3.5 or GPT-4, successfully leveraging the language model's powerful ability for long-range reasoning. Instead of incorporating complex video-specific techniques, LLoVi harnesses two stages: initially, it segments a long video into short clips, each described textually by a visual captioner. Subsequently, an LLM integrates these descriptions to perform comprehensive video reasoning and answer questions about the video content.

Crucial Factors and Methodology Insights

An extensive empirical study within the paper highlights several critical components for effective LVQA performance. The choice of both the visual captioner and the LLM proved to be significant. It was further discovered that a specialized LLM prompt structure substantially elevates performance. This prompt instructs the LLM to first deliver a consolidated summary of the video captions, which simplifies the task of accurately responding to questions based on this synthesized narrative. Remarkably, this framework demonstrated superior accuracy on the EgoSchema benchmark, surpassing former leading techniques by considerable margins.

Generalization and Grounded Question-Answering

This streamlined framework proved its robustness across a variety of datasets, indicating its applicability to diverse LVQA scenarios. Moreover, the researchers extended the framework to 'grounded LVQA', where the model identifies and grounds the specific video segment relevant to a question. This extension led to the framework outperforming existing methods on a benchmark designed for this purpose.

Conclusion

The simplicity and zero-shot learning ability of LLoVi make it a promising direction for future development in video understanding. Details about the study can be examined more closely in the literature, and the code implementation for this framework is openly available, which benefits the research community. By abstaining from complicated video-specific mechanisms, LLoVi empowers LLMs to tap into their innate long-range reasoning and does so with noteworthy efficiency and efficacy.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.