MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Published 14 Oct 2023 in cs.CV | (2310.09478v3)

Abstract: LLMs have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at https://minigpt-v2.github.io/

Abstract PDF Upgrade to Chat

Authors (10)

Citations (360)

View on Semantic Scholar

Summary

The paper introduces MiniGPT-v2, a unified vision-language model that uses a visual backbone paired with LLaMA2-chat to boost performance on tasks like VQA and image description.
The model’s three-stage training process, including pretraining, multi-task training, and multimodal instruction tuning, leverages high-resolution images and fine-grained datasets to optimize performance and reduce hallucination.
Its innovative use of task identifiers and efficient token aggregation paves the way for more integrated and robust AI systems in complex vision-language applications.

Overview of MiniGPT-v2: A Unified Interface for Vision-Language Tasks

The paper "MiniGPT-v2: LLM As a Unified Interface for Vision-Language Multi-task Learning" presents a novel approach for integrating vision and language processing tasks using a single model. This work addresses the complexities inherent in performing diverse vision-language tasks such as image description, visual question answering (VQA), and visual grounding, using a unified framework.

Model Architecture

MiniGPT-v2 operates by utilizing a unique task identifier system, allowing the model to distinguish between different vision-language tasks efficiently. The architecture comprises a visual backbone, a linear projection layer, and a LLM, specifically adopting LLaMA2-chat (7B). A critical feature is the aggregation of visual tokens, which optimizes computational efficiency by condensing 75% of token input length. The model is trained using high-resolution images (448x448), enhancing visual perception capabilities.

Training Strategy

The model undergoes a three-stage training process:

Pretraining: Initial exposure to both weakly-labeled and fine-grained datasets aims to build a broad vision-language knowledge base.
Multi-task Training: This stage focuses exclusively on fine-grained datasets to refine task performance, ensuring more precise task execution across various vision-language tasks.
Multi-modal Instruction Tuning: The model is trained with specific instruction datasets to enhance its ability to follow multi-modal instructions effectively, integrating both image and language datasets.

Experimental Results

The experiments highlight MiniGPT-v2's robust performance across various tasks compared to other multi-modal models:

On visual question answering, it achieved top-tier accuracy, outperforming models like InstructBLIP and MiniGPT-4 on several benchmarks like VSR and OKVQA.
In referring expression comprehension tasks, MiniGPT-v2 set new performance standards among generalist models, although not yet exceeding specialist models.
The model showcases reduced hallucination compared to baseline models, achieving low scores on CHAIR metrics when generating detailed image descriptions.

Implications and Future Work

MiniGPT-v2 demonstrates significant advancements in unifying vision-language tasks in a single model interface, paving the way for more integrated approaches in artificial intelligence. The model's use of high-resolution images and task-specific identifiers enhances its adaptability to diverse tasks, suggesting a powerful tool for developing visual AI assistants and chatbots.

Future developments could focus on integrating stronger vision backbones, exploring larger LLM integrations, and minimizing hallucination in image-to-text tasks. Expanding the variety of datasets could further enhance model robustness, offering new prospects for applications in complex real-world scenarios.

In conclusion, MiniGPT-v2 represents a critical step forward in vision-LLM development, offering a unified framework that handles multiple tasks with notable efficacy and efficiency.

Markdown Report Issue