Emergent Mind

MANTIS: Interleaved Multi-Image Instruction Tuning

(2405.01483)
Published May 2, 2024 in cs.CV , cs.AI , and cs.CL

Abstract

The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective. In this paper, we aim at building strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K instances from 14 multi-image datasets. We design Mantis-Instruct to cover different multi-image skills like co-reference, reasoning, comparing, temporal understanding. We combine Mantis-Instruct with several single-image visual-language datasets to train our model Mantis to handle any interleaved image-text inputs. We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks. Though only requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B can achieve state-of-the-art performance on all the multi-image benchmarks and beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute points. We observe that Mantis performs equivalently well on the held-in and held-out evaluation benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis can maintain a strong single-image performance on par with CogVLM and Emu2. Our results are particularly encouraging as it shows that low-cost instruction tuning is indeed much more effective than intensive pre-training in terms of building multi-image LMMs.

Mantis model's capabilities in understanding multiple images to complete specified tasks.

Overview

  • The paper introduces Mantis, an AI model tailored for enhanced multi-image understanding, tackling gaps in existing large multimodal models (LMMs).

  • Mantis employs instruction tuning on the Mantis-Instruct dataset, needing fewer resources for training and achieving superior performance across various benchmarks.

  • The model's innovative use of interleaved text-image input handling, efficient training, and its implications for future AI research and applications are discussed.

Interrogating Mantis: Enhancing AI's Multi-Image Understanding

Introduction to Mantis: A Leap in Large Multimodal Models

While the AI research community has made significant strides in developing models that effectively handle single-image inputs, multi-image tasks have remained relatively underserved. This discrepancy becomes apparent when considering real-world applications where understanding sequences or sets of images is crucial. The paper discusses an innovative model named Mantis, designed specifically to address this gap. Let's uncover its unique approach and how it stands against existing models.

What Sets Mantis Apart

Mantis targets direct improvements in handling multi-image scenarios by employing a method known as instruction tuning on a specially curated dataset dubbed Mantis-Instruct. This dataset boasts 721K instances covering a variety of multi-image tasks designed to bolster the model's capacity in co-reference, reasoning, comparison, and temporal understanding of visual data.

Here's what makes Mantis noteworthy:

  • Efficient Training: Unlike its predecessors, which relied on pre-training on vast amounts of data, Mantis achieves superior results using a fraction of the resources—just 36 hours on 16xA100-40G GPUs.
  • Strong Performance Metrics: Mantis not only outperforms existing multi-image LMMs on various benchmarks but does so by a noticeable margin, achieving state-of-the-art results and even rivaling models like GPT-4V in specific tasks.
  • Robust Generalization: Its performance is consistent across both 'held-in' and 'held-out' evaluation settings, evidencing strong generalization abilities.
  • Low Resource, High Yield: By demonstrating that low-cost instruction tuning is more effective than intensive pre-training, Mantis offers a more accessible model-building methodology that could democratize advanced AI research.

Under the Hood: How Mantis Achieves Its Edge

Mantis combines instruction tuning with a pre-trained language model and a visual transformer encoder, leveraging both textual and visual data. The underlying architecture ensures that the model can handle interleaved text-image inputs effectively, preparing it for complex real-world applications where such capabilities are indispensable.

Here are the key technical pillars that support Mantis:

  1. Diverse Data Handling: By training on varied datasets, each representing different skills, Mantis is not just learning to recognize images but to understand the context, differences, and temporal dynamics within them.
  2. Innovative Training Routine: Instead of the traditional massive pre-training routine, Mantis uses targeted instruction tuning, which makes it resource-efficient and quick to adapt to new types of data.

Future Implications and Opportunities

The success of Mantis suggests several exciting pathways for future research and application:

  • Enhanced Real-World Applications: From automated systems in security domains that require analyzing multiple video feeds to medical diagnosis involving sequences of scans, Mantis’s capabilities could be transformative.
  • Methodological Shifts in AI Training: Mantis sets a precedent for using more focused, less resource-intensive training methods, which could be particularly beneficial for academic institutions and smaller labs.
  • Broader Accessibility: With its efficient use of resources and strong performance, Mantis opens up possibilities for more entities to experiment with and deploy advanced AI solutions.

Wrapping Up

As we step into an era where the integration of AI in processing complex, multi-image inputs becomes crucial, models like Mantis not only pave the way for more sophisticated applications but also highlight the shift towards more sustainable, effective AI training methodologies. The research behind Mantis illuminates a path forward where AI can be both powerful and within reach, a combination that will undoubtedly fuel the next wave of innovations in the field.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.