LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models (2407.07895v2)

Published 10 Jul 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

Citations (71)

View on Semantic Scholar

Summary

The paper introduces a unified interleaved data format that allows models to handle multi-image, video, and 3D scenarios collectively.
It leverages the comprehensive M4-Instruct dataset with 1,177.6k samples and continued training strategies to enhance performance.
Experimental results demonstrate state-of-the-art efficiency on multi-image benchmarks along with emerging cross-domain transfer capabilities.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Introduction

The paper "LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models" proposes a novel approach to unify various visual modalities within Large Multimodal Models (LMMs) using an interleaved data format. The existing LMMs have primarily been focused on single-image tasks, with less exploration into multi-image and other complex visual scenarios. This work introduces LLaVA-NeXT-Interleave, a model designed to process Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios collectively in LMMs.

Unified Interleaved Data Format

A significant contribution of this work is the adoption of an interleaved data format as a universal template. By considering images, video frames, and 3D views as part of an interleaved sequence, this approach creates a cohesive framework for training. This facilitates the model's ability to handle diverse data types seamlessly. It further simplifies the training process across various domains by allowing the emergence of new capabilities through cross-domain task composition.

Figure 1: Tasks in the M4-Instruct dataset including multiple images, video frames, and 3D views, organized in an interleaved format for unified processing.

M4-Instruct Dataset

To harness the potential of the unified format, the authors compiled the M4-Instruct dataset comprising 1,177.6k samples across four primary domains and 14 tasks derived from 41 datasets. This comprehensive dataset supports multi-image, video, 3D, and single-image scenarios, enabling the LMMs to evolve with diverse tasks and requirements.

Figure 2: Task examples in M4-Instruct showcasing scenarios from multi-image, multi-frame, and multi-view domains.

LLaVA-Interleave Bench

To evaluate the performance of LLaVA-NeXT-Interleave, the LLaVA-Interleave Bench was established. It is a benchmark designed to assess the model's capabilities in addressing multiple image scenarios. The benchmark includes a wide range of tasks both within and outside the training domain, providing a rigorous platform for performance validation.

Figure 3: M4-Instruct training data statistics illustrating the distribution of samples across various scenarios.

Figure 4: LLaVA-Interleave Bench statistics highlighting the evaluation metrics and tasks incorporated.

Training and Implementation Strategies

The implementation of LLaVA-NeXT-Interleave involves several techniques to enhance performance across various visual domains:

Continued Training from Single-image Models: By building upon pre-trained single-image models, the approach leverages existing capabilities, transitioning smoothly into more complex multi-image tasks.
Mixed Interleaved Data Formats: During training, using various interleaved and non-interleaved data formats adds flexibility and robustness to the model during inference.
Combining Data Scenarios: Training on a mix of single-image, multi-image, and video tasks provides complementary insights, boosting individual task performance.

Experimental Results

The experimental results demonstrate the efficacy of the LLaVA-NeXT-Interleave model. It achieves state-of-the-art performance in several multi-image benchmarks and maintains high efficiency in single-image tasks. Additionally, the model exhibits emerging capabilities, such as task transfer across different modalities and scenarios.

Conclusion

LLaVA-NeXT-Interleave marks a significant step towards unifying heterogeneous visual data within large multimodal models, expanding their versatility and applicability in real-world scenarios. The methodologies and datasets introduced could pave the way for more integrated and capable AI systems that accommodate diverse visual information seamlessly. Future developments may explore deeper integration of these concepts, unlocking further potentials in AI-driven visual understanding and interaction.