Emergent Mind

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced LLMs, they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences. On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding. To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling). The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders. Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering. Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance. Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos. This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

VideoGPT+ integrates image and video encoders for comprehensive video understanding and efficient temporal detail retention.

Overview

  • VideoGPT+ integrates image and video encoders to provide enhanced video understanding by capturing both spatial details and temporal context.

  • The model employs a dual encoder design and introduces novel benchmarks and datasets, demonstrating strong performance across various video understanding tasks.

  • Key advancements include the introduction of segment-wise sampling, dual vision encoders, visual adapter modules, and efficient LLM fine-tuning, leading to superior results in video comprehension.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

The paper presents VideoGPT+, a model that combines the complementary benefits of image and video encoders for enhanced video understanding. The approach addresses the limitations of current Large Multimodal Models (LMMs) that utilize either image or video encoders. Image encoders capture rich spatial details but lack temporal context, while video encoders provide temporal context but often at the expense of spatial resolution and computational efficiency. VideoGPT+ overcomes these constraints by integrating both encoder types, thereby enabling robust spatiotemporal understanding.

Methodology

VideoGPT+ leverages a dual encoder design incorporating a high-resolution image encoder and a temporal-context-aware video encoder. The key components of VideoGPT+ include:

  1. Segment-wise Sampling: The model divides videos into smaller segments and applies a segment-wise sampling strategy to ensure comprehensive temporal context capture. This method contrasts with uniform sampling, which can miss significant temporal dynamics.
  2. Dual Vision Encoder: The architecture employs a CLIP model (ViT-L/14) for detailed spatial information and an InternVideo-v2 for temporal context. This dual strategy ensures a rich representation of both spatial and temporal features.
  3. Visual Adapter Module: Features extracted from image and video encoders are projected into a common space using visual adapters. This involves projecting image and video features into the language space through specific projection layers, followed by adaptive pooling to manage computational complexity effectively.
  4. Large Language Model (LLM): The integrated features are then inputted into a fine-tuned LLM, which processes the information to generate comprehensive video-based responses. The LLM fine-tuning utilizes LoRA for efficient training.

Results and Evaluation

VideoGPT+ demonstrates strong performance across several benchmarks, indicating its efficacy in video understanding tasks:

  • VCGBench: VideoGPT+ achieved an average score of 3.28, outperforming previous state-of-the-art models across all evaluation metrics, including Correctness of Information (CI), Detail Orientation (DO), Contextual Understanding (CU), Temporal Understanding (TU), and Consistency (CO).
  • VCGBench-Diverse: Introduced in this work, this benchmark covers 18 broad video categories and extends the evaluation to varying video capturing methods and reasoning complexities. VideoGPT+ achieved an average score of 2.47, showing significant improvements in spatial and temporal understanding.
  • MVBench: On the MVBench, VideoGPT+ excelled across a wide range of specific tasks, including action prediction and object interaction, reflecting its advanced temporal understanding capabilities.
  • Zero-shot Question-Answering: The model showed superior generalization capabilities on diverse datasets, achieving the highest scores in terms of both accuracy and comprehensive responses.

Dataset and Benchmark Contributions

The paper also introduces VCG+, a comprehensive 112K video-instruction set generated through a semi-automatic annotation pipeline, enhancing model training data quality. Additionally, VCGBench-Diverse offers a diverse and robust benchmark to evaluate video LMMs comprehensively across multiple video categories, capturing different filming techniques and reasoning complexities.

Implications and Future Directions

The integration of both image and video encoders in VideoGPT+ offers significant improvements in video understanding, particularly in capturing fine-grained spatial details and temporal dynamics. The strong numerical results across multiple benchmarks validate the model's efficacy.

The dual encoder design and enhanced annotation techniques pave the way for future research in video understanding, particularly in:

  • Action Localization and Prediction: Future models could focus on improving the precision of action boundaries within videos.
  • Long Video Navigation: Handling very long videos remains challenging; thus, developing more efficient segment-wise or hierarchical approaches could offer potential solutions.
  • Path Following and Reasoning: As video understanding models evolve, enhancing their capability to follow long, complex paths and reason about events within these contexts will be critical for advancing practical applications.

In summary, VideoGPT+ sets a new precedent in video understanding by effectively combining the strengths of image and video encoders. The introduction of a diverse benchmark and an enriched dataset further solidifies its contribution to advancing the field of large multimodal models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.