Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (2403.18814v1)

Published 27 Mar 2024 in cs.CV, cs.AI, and cs.CL

Abstract: In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision LLMs (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE LLMs from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

References (71)

Citations (150)

View on Semantic Scholar

Summary

The paper introduces Mini-Gemini, a novel framework that integrates dual vision encoders to efficiently enhance high-resolution visual tokens without increasing their count.
The paper demonstrates improved performance by leveraging a meticulously constructed, high-quality dataset tailored for advanced image comprehension and reasoning tasks.
The paper establishes an any-to-any inference model that processes both image and text inputs, setting new benchmarks in zero-shot vision tasks across various large language models.

Enhancing Vision LLMs with Mini-Gemini: A Dive into Multi-Modality, High-Resolution, and Data Quality

Overview of Mini-Gemini

Mini-Gemini introduces a novel framework aimed at enhancing the capabilities of Vision LLMs (VLMs) by focusing on three key areas: utilization of high-resolution visual tokens, improvement of data quality, and expansion of any-to-any workflow capabilities. By integrating an additional visual encoder, the framework refines high-resolution visual tokens without increasing their count, thereby optimizing computational efficiency. The construction of a high-quality dataset tailored for image comprehension and reasoning-based generation further broadens the operational capabilities of VLMs. Mini-Gemini demonstrates its effectiveness across several dense and Mixture of Experts (MoE) LLMs ranging from 2B to 34B parameters, setting new benchmarks in zero-shot vision tasks.

Technical Insights

Dual Vision Encoders and High-Resolution Image Processing

Mini-Gemini's architecture incorporates dual vision encoders that together enhance the quality and resolution of visual tokens. The low-resolution encoder processes images to create a foundational visual embedding, while the high-resolution encoder provides detailed visual cues. This dual-encoder system, inspired by the Gemini constellation, is designed for efficient processing of high-resolution images without burdening the computational framework with excessive visual tokens.

Enhanced Data Quality

The paper underscores the importance of high-quality data in improving the performance of VLMs. Mini-Gemini leverages a meticulously constructed dataset from various public sources, focusing on image comprehension, text and image generation, and reasoning. The inclusion of high-quality responses and task-oriented instructions significantly contributes to the model's enhanced understanding and generation capabilities.

Expanding VLM Functions

At the heart of Mini-Gemini is an any-to-any inference model that processes both image and text inputs to generate corresponding outputs. This flexibility is achieved through a novel visual token enhancement pipeline and the integration of cutting-edge generative models. The approach not only improves the performance of VLMs in comprehension tasks but also paves the way for innovative applications in image and text generation.

Empirical Validation and Performance

Extensive experiments demonstrate Mini-Gemini's superior performance across a range of zero-shot benchmarks. The framework consistently outperforms existing models, including surpassing developed private models in complex datasets such as MMB and MMU. The empirical results highlight Mini-Gemini's leading capabilities in handling advanced multi-modal tasks, attesting to its potential as a robust tool in the field of VLMs.

Future Directions and Theoretical Implications

The introduction of Mini-Gemini opens new avenues for research in enhancing the performance and applicability of Vision LLMs. The framework's scalable architecture, combined with its focus on high-resolution visual tokens and high-quality data, sets a new standard for future developments in the field. The theoretical exploration of high-resolution image processing and data quality improvements provides valuable insights into the optimization of VLMs. As the community continues to push the boundaries of what's possible with generative AI, Mini-Gemini stands as a significant milestone in the journey towards fully realizing the potential of multi-modality in AI models.

Concluding Remarks

Mini-Gemini represents a significant advancement in the field of Vision LLMs, showcasing the vital role of high-resolution visual processing, quality data, and flexible workflow capabilities. Its exceptional performance across a breadth of benchmarks highlights the effectiveness of its novel approach. As the field moves forward, Mini-Gemini's contributions will undoubtedly serve as a foundation for further innovations, driving the evolution of VLMs towards new heights of capability and application.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1773170068521713713

https://twitter.com/mervenoyann/status/1783864388249694520

https://twitter.com/BrianRoemmele/status/1773179489931157969

https://twitter.com/mervenoyann/status/1782862655583068187

https://twitter.com/TheTuringPost/status/1776233461030044124

https://twitter.com/twistin456/status/1773369930710126682

YouTube

Show All Videos

HackerNews

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models (83 points, 7 comments)