Emergent Mind

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

(2407.07895)
Published Jul 10, 2024 in cs.CV , cs.CL , and cs.LG

Abstract

Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

Overview

  • LLaVA-NeXT-Interleave is an advanced framework for large multimodal models (LMMs), integrating multi-image, video, and 3D data into a versatile system.

  • Utilizing the M4-Instruct dataset with 1,177.6k samples spanning 14 tasks from 41 datasets, the model achieves state-of-the-art performance across diverse visual tasks.

  • Innovations include initialization from single-image models, mixed interleaved data formats, and combined data scenarios, leading to robust performance and emerging capabilities not explicitly trained for.

Overview of LLaVA-NeXT-Interleave: Addressing Complex Multimodal Tasks in LLMs

The academic paper titled "LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models" presents a comprehensive approach for advancing large multimodal models (LMMs) by integrating diverse visual data types into a unified framework. This work represents a significant step forward in addressing the limitations of existing LMMs that primarily focus on single-image tasks without a generalized approach to multi-image scenarios.

Framework and Data Integration

LLaVA-NeXT-Interleave introduces an innovative framework designed to handle diverse visual data representations. This model extends its capabilities to encompass multi-image, multi-frame (video), and multi-view (3D) scenarios, while maintaining performance for traditional single-image tasks. This is accomplished through utilizing an interleaved data format that serves as a universal template for visual instruction tuning.

The authors compiled the M4-Instruct dataset, a large-scale, high-quality dataset consisting of 1,177.6k samples which spans four primary domains: multi-image, video, 3D, and single-image. This dataset embraces 14 specific tasks drawn from 41 datasets, providing a rich and diverse training ground for the LMM. Additionally, the study introduces the LLaVA-Interleave Bench, encompassing both in-domain and out-domain benchmarks to evaluate multi-image performance rigorously.

Experimental Evaluation

The paper reports that LLaVA-NeXT-Interleave achieves state-of-the-art (SoTA) performance across various multi-image, video, and 3D benchmarks. Specifically:

  1. Multi-Image Performance: The model significantly surpasses previous models on the LLaVA-Interleave Bench for both in-domain and out-domain evaluations. For instance, it demonstrates average scores of 62.3 and 44.3 for in-domain and out-domain tasks respectively with the 14B parameter model configuration.
  2. Video Performance: The model excels in video comprehension tasks, achieving SoTA results on multiple benchmarks such as NExT-QA and ActivityNet-QA. The introduction of diverse training data enables the model to handle video captioning and video QA with notable accuracy, further highlighting its robustness for temporal data understanding.
  3. 3D Performance: The model demonstrates exceptional capability in interpreting 3D environments, employing multi-view images for spatial understanding. It outperforms existing methods, including those leveraging additional modalities like point clouds, achieving an average score of 59.2 on in-domain benchmarks.
  4. Single-Image Performance: Despite the model's extensive training for multi-image tasks, it maintains competitive performance on traditional single-image benchmarks, affirming its versatility. The integration of anyres training enhances its adaptability to multiple patch scenarios.

Methodological Innovations

The paper identifies three core methodological innovations contributing to the model's performance:

  1. Initialization from Single-Image Models: The approach leverages a pre-trained LLaVA-NeXT-Image model, significantly enhancing the model's ability to handle complex multi-image tasks.
  2. Mixed Interleaved Data Formats: The use of mixed training formats (in-the-front and interleaved) for positioning image tokens enables the model to be versatile in handling varied inference modes.
  3. Combining Diverse Data Scenarios: Integrating different data scenarios (multi-image, video, 3D, and single-image) during training enhances the model's overall capability and enables the emergence of new, cross-domain abilities.

Emerging Capabilities and Future Implications

One of the notable aspects of LLaVA-NeXT-Interleave is its emerging capabilities. The model demonstrates new abilities during inference that were not explicitly trained for, such as transferring task knowledge from single-image to multi-image contexts and handling real-world applications with high accuracy. This robustness suggests potential for broader applications in AI, such as enhanced multimedia understanding and autonomous systems capable of real-time decision-making across varied data types.

Conclusion

LLaVA-NeXT-Interleave’s ability to unify and excel across multi-image, video, 3D, and single-image tasks marks a significant development in the field of multimodal AI. The model's innovative use of interleaved data formats and comprehensive training dataset positions it as a versatile tool capable of addressing complex real-world challenges. Future research may further explore the boundaries of this unified approach, potentially integrating additional modalities and optimizing for even greater flexibility and performance.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub