Emergent Mind

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

(2403.18814)
Published Mar 27, 2024 in cs.CV , cs.AI , and cs.CL

Abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE LLMs from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Mini-Gemini's qualitative results showcase its advanced understanding at a high resolution.

Overview

  • Mini-Gemini presents a new framework enhancing Vision Language Models (VLMs) through high-resolution visual tokens, improved data quality, and any-to-any workflow capabilities.

  • It introduces a dual vision encoder system for efficient high-resolution image processing without increased visual token count.

  • The framework uses a high-quality, diverse dataset for better image comprehension and generation, improving VLMs' understanding and creativity.

  • Extensive experiments show Mini-Gemini outperforms existing VLMs in zero-shot vision tasks, proving its superior multi-modal task handling.

Enhancing Vision Language Models with Mini-Gemini: A Dive into Multi-Modality, High-Resolution, and Data Quality

Overview of Mini-Gemini

Mini-Gemini introduces a novel framework aimed at enhancing the capabilities of Vision Language Models (VLMs) by focusing on three key areas: utilization of high-resolution visual tokens, improvement of data quality, and expansion of any-to-any workflow capabilities. By integrating an additional visual encoder, the framework refines high-resolution visual tokens without increasing their count, thereby optimizing computational efficiency. The construction of a high-quality dataset tailored for image comprehension and reasoning-based generation further broadens the operational capabilities of VLMs. Mini-Gemini demonstrates its effectiveness across several dense and Mixture of Experts (MoE) LLMs ranging from 2B to 34B parameters, setting new benchmarks in zero-shot vision tasks.

Technical Insights

Dual Vision Encoders and High-Resolution Image Processing

Mini-Gemini's architecture incorporates dual vision encoders that together enhance the quality and resolution of visual tokens. The low-resolution encoder processes images to create a foundational visual embedding, while the high-resolution encoder provides detailed visual cues. This dual-encoder system, inspired by the Gemini constellation, is designed for efficient processing of high-resolution images without burdening the computational framework with excessive visual tokens.

Enhanced Data Quality

The paper underscores the importance of high-quality data in improving the performance of VLMs. Mini-Gemini leverages a meticulously constructed dataset from various public sources, focusing on image comprehension, text and image generation, and reasoning. The inclusion of high-quality responses and task-oriented instructions significantly contributes to the model's enhanced understanding and generation capabilities.

Expanding VLM Functions

At the heart of Mini-Gemini is an any-to-any inference model that processes both image and text inputs to generate corresponding outputs. This flexibility is achieved through a novel visual token enhancement pipeline and the integration of cutting-edge generative models. The approach not only improves the performance of VLMs in comprehension tasks but also paves the way for innovative applications in image and text generation.

Empirical Validation and Performance

Extensive experiments demonstrate Mini-Gemini's superior performance across a range of zero-shot benchmarks. The framework consistently outperforms existing models, including surpassing developed private models in complex datasets such as MMB and MMU. The empirical results highlight Mini-Gemini's leading capabilities in handling advanced multi-modal tasks, attesting to its potential as a robust tool in the realm of VLMs.

Future Directions and Theoretical Implications

The introduction of Mini-Gemini opens new avenues for research in enhancing the performance and applicability of Vision Language Models. The framework's scalable architecture, combined with its focus on high-resolution visual tokens and high-quality data, sets a new standard for future developments in the field. The theoretical exploration of high-resolution image processing and data quality improvements provides valuable insights into the optimization of VLMs. As the community continues to push the boundaries of what's possible with generative AI, Mini-Gemini stands as a significant milestone in the journey towards fully realizing the potential of multi-modality in AI models.

Concluding Remarks

Mini-Gemini represents a significant advancement in the field of Vision Language Models, showcasing the vital role of high-resolution visual processing, quality data, and flexible workflow capabilities. Its exceptional performance across a breadth of benchmarks highlights the effectiveness of its novel approach. As the field moves forward, Mini-Gemini's contributions will undoubtedly serve as a foundation for further innovations, driving the evolution of VLMs towards new heights of capability and application.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube