MANTIS: Interleaved Multi-Image Instruction Tuning (2405.01483v3)

Published 2 May 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.

Citations (63)

View on Semantic Scholar

Summary

The paper presents Mantis, a multi-image model that uses instruction tuning on a curated 721K-instance dataset to enhance visual reasoning and temporal understanding.
The paper demonstrates that with only 36 hours on 16xA100-40G GPUs, Mantis achieves robust, state-of-the-art performance across diverse multi-image tasks.
The paper reveals that targeted, low-resource instruction tuning can democratize advanced AI research by replacing extensive pre-training methods.

Interrogating Mantis: Enhancing AI's Multi-Image Understanding

Introduction to Mantis: A Leap in Large Multimodal Models

While the AI research community has made significant strides in developing models that effectively handle single-image inputs, multi-image tasks have remained relatively underserved. This discrepancy becomes apparent when considering real-world applications where understanding sequences or sets of images is crucial. The paper discusses an innovative model named Mantis, designed specifically to address this gap. Let's uncover its unique approach and how it stands against existing models.

What Sets Mantis Apart

Mantis targets direct improvements in handling multi-image scenarios by employing a method known as instruction tuning on a specially curated dataset dubbed Mantis-Instruct. This dataset boasts 721K instances covering a variety of multi-image tasks designed to bolster the model's capacity in co-reference, reasoning, comparison, and temporal understanding of visual data.

Here's what makes Mantis noteworthy:

Efficient Training: Unlike its predecessors, which relied on pre-training on vast amounts of data, Mantis achieves superior results using a fraction of the resources—just 36 hours on 16xA100-40G GPUs.
Strong Performance Metrics: Mantis not only outperforms existing multi-image LMMs on various benchmarks but does so by a noticeable margin, achieving state-of-the-art results and even rivaling models like GPT-4V in specific tasks.
Robust Generalization: Its performance is consistent across both 'held-in' and 'held-out' evaluation settings, evidencing strong generalization abilities.
Low Resource, High Yield: By demonstrating that low-cost instruction tuning is more effective than intensive pre-training, Mantis offers a more accessible model-building methodology that could democratize advanced AI research.

Under the Hood: How Mantis Achieves Its Edge

Mantis combines instruction tuning with a pre-trained LLM and a visual transformer encoder, leveraging both textual and visual data. The underlying architecture ensures that the model can handle interleaved text-image inputs effectively, preparing it for complex real-world applications where such capabilities are indispensable.

Here are the key technical pillars that support Mantis:

Diverse Data Handling: By training on varied datasets, each representing different skills, Mantis is not just learning to recognize images but to understand the context, differences, and temporal dynamics within them.
Innovative Training Routine: Instead of the traditional massive pre-training routine, Mantis uses targeted instruction tuning, which makes it resource-efficient and quick to adapt to new types of data.

Future Implications and Opportunities

The success of Mantis suggests several exciting pathways for future research and application:

Enhanced Real-World Applications: From automated systems in security domains that require analyzing multiple video feeds to medical diagnosis involving sequences of scans, Mantis’s capabilities could be transformative.
Methodological Shifts in AI Training: Mantis sets a precedent for using more focused, less resource-intensive training methods, which could be particularly beneficial for academic institutions and smaller labs.
Broader Accessibility: With its efficient use of resources and strong performance, Mantis opens up possibilities for more entities to experiment with and deploy advanced AI solutions.

Wrapping Up

As we step into an era where the integration of AI in processing complex, multi-image inputs becomes crucial, models like Mantis not only pave the way for more sophisticated applications but also highlight the shift towards more sustainable, effective AI training methodologies. The research behind Mantis illuminates a path forward where AI can be both powerful and within reach, a combination that will undoubtedly fuel the next wave of innovations in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/WenhuChen/status/1786587351042453582

https://twitter.com/fly51fly/status/1787232437719777618

https://twitter.com/arxivsanitybot/status/1786748829095641218

https://twitter.com/GptMaestro/status/1787253562856333452

https://twitter.com/realmofresearch/status/1786812398218682514

https://twitter.com/gm8xx8/status/1786205302766854579