MIMIC-IT: Multi-Modal In-Context Instruction Tuning

Published 8 Jun 2023 in cs.CV, cs.AI, cs.CL, and cs.HC | (2306.05425v1)

Abstract: High-quality instructions and responses are essential for the zero-shot performance of LLMs on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-LLMs (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (190)

View on Semantic Scholar

Summary

The paper introduces a novel MIMIC-IT dataset with 2.8M multi-modal instruction pairs to enhance VLMs' zero-shot performance.
It details the Syphus toolchain that automates high-quality data creation, filling gaps in diversity and creativity for multi-modal instructions.
The paper demonstrates Otter's superior performance in perception, reasoning, and few-shot learning, underlining its potential for interactive AI applications.

The paper "MIMIC-IT: Multi-Modal In-Context Instruction Tuning" proposes a comprehensive approach to enhancing the zero-shot performance of vision-LLMs (VLMs) through a dataset called MIMIC-IT. This dataset consists of 2.8 million multi-modal instruction-response pairs, with 2.2 million unique instructions derived from images and videos, designed to enhance the capabilities of VLMs in perception, reasoning, and planning. The paper outlines the construction of this dataset, the training of a model named Otter using this dataset, and the evaluation of Otter's performance against existing benchmarks.

Dataset Construction

The MIMIC-IT dataset is engineered to fill the existing gaps in vision-language instruction datasets that suffer from limited quantity, diversity, and creativity, which constrain the generalization of interactive VLMs. The dataset's construction employs a toolchain named Syphus, which automates the annotation process by combining human expertise with the capabilities of GPT. Syphus is pivotal in generating high-quality instruction-response pairs, thus facilitating a scalable solution for data creation. One notable aspect is the use of multi-modal in-context information, enabling a richer conversational context that includes visual data, such as photos and videos. This holistic approach allows VLMs to better understand and process interactive tasks involving visual scenes.

Model Training and Evaluation

Utilizing the MIMIC-IT dataset, the researchers trained Otter, a large VLM. Otter's performance was exhaustively evaluated across various vision-language benchmarks, demonstrating significant prowess in multi-modal perception, reasoning, and in-context learning. In human evaluations, Otter was found to effectively align with user intentions, highlighting its potential as a practical conversational assistant.

The paper details two main areas of evaluation:

Perception and Reasoning: Otter was assessed using a set of benchmarks to measure its ability to understand and reason about visual content. The results showcased Otter's superior performance compared to existing VLMs, achieving high accuracy in tasks that involved complex scenes and narrative comprehension.
In-Context Learning: The model showcased robust capabilities in few-shot learning scenarios, outperforming its predecessors in tasks that required understanding new instructions based on minimal examples. This ability highlights Otter's potential to adapt to novel tasks with limited supervision.

Implications and Future Directions

The introduction of the MIMIC-IT dataset and the development of Otter mark a significant step in the advancement of multi-modal AI systems. The dataset's design principles focus on providing diverse and context-rich instruction sets, addressing critical gaps in current datasets. The practicality of Otter in real-world applications, such as enhanced AR headset functionalities and more intuitive human-AI interaction, is a promising development.

Speculating on future advancements, this work suggests a trajectory toward more adaptive and versatile AI systems that can efficiently process and interpret both language and visual information. The research sets the groundwork for leveraging diverse data types to train models capable of generalized understanding, which could be invaluable across sectors like autonomous systems, interactive media, and assistive technologies.

Conclusively, this work is a notable contribution to the field, presenting methodologies and datasets that are likely to drive further innovations in VLMs. The release of MIMIC-IT and associated tools is poised to be a valuable resource for the community, facilitating new research avenues in multi-modal AI.

Markdown Report Issue