How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (2404.16821v2)

Published 25 Apr 2024 in cs.CV

Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal LLM (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

Citations (272)

View on Semantic Scholar

Summary

The paper presents InternVL 1.5, an enhanced open-source multimodal LLM that leverages a robust vision encoder to improve visual content understanding.
It employs a dynamic high-resolution tiling method to adapt image processing across varying resolutions, achieving state-of-the-art OCR benchmarks.
The integration of a comprehensive bilingual dataset expands its applicability in multilingual scenarios and document analysis.

InternVL 1.5: Bridging the Gap in Multimodal Understanding between Open-Source and Proprietary Models

Overview

The report introduces InternVL 1.5, an upgraded open-source multimodal LLM (MLLM). This model incorporates major enhancements designed to close the existing capability gap between open-source and commercial proprietary models. By implementing a strong vision encoder, a dynamic high-resolution strategy, and a comprehensive high-quality bilingual dataset, InternVL 1.5 aims to robustly enhance performance in a variety of multimodal understanding tasks.

Key Improvements

Strong Vision Encoder: The model introduces continuous learning improvements to InternViT-6B, a key component for boosting visual understanding. This allows for a significant enhancement in visual content adaptability and transferability across different LLM implementations.
Dynamic High-Resolution Strategy: InternVL 1.5 handles images by dividing them into tiles based on input resolution and aspect ratio, supporting resolutions up to 4K. This technique provides flexibility in dealing with different image types and resolutions, improving the model's performance in detailed scene and document understanding.
High-Quality Bilingual Dataset: The dataset encompasses a diversified range of natural scenes, documents, and conversations in English and Chinese. This dataset not only enriches the model’s training data but also extends its optimization for performance in OCR and language-specific tasks.

Performance Analysis

InternVL 1.5 shows impressive results in benchmarking compared to both open-source and proprietary models, achieving state-of-the-art results in 8 out of its 18 evaluated benchmarks. Notably, it outperforms many leading proprietary models such as Grok-1.5V and GPT-4V in several OCR-related benchmarks.

Theoretical and Practical Implications

Theoretical Advancements:

The integration of a strong vision encoder exemplifies advancements in continuous learning strategies that refine a model’s adaptability and enhance its performance over a diverse set of visual inputs.
The approach of using dynamic high-resolution processing highlights innovative ways to handle various image resolutions effectively, enriching the research in responsive AI-based image processing.

Practical Implications:

The enriching of bilingual dataset capabilities opens avenues for real-world applications to handle multilingual contexts more proficiently.
Achieving higher accuracy in OCR-related benchmarks implies practical usage in text extraction from documents and images, facilitating robust applications in areas like automated document processing and content management systems.

Future Directions

The ongoing development of InternVL 1.5 and its contribution to narrowing the performance gap highlights the potential future enhancements. These could include expanding the multilanguage capabilities to include more languages and dialects, further refining the image processing capabilities to handle more complex and varied document types, and enhancing the model's interactive capabilities for more refined multimodal interactions. Furthermore, as the AI field evolves, more robust and generalized training approaches could be developed to handle an even broader spectrum of multimodal tasks.

In conclusion, InternVL 1.5 represents a significant advance in the field of open-source MLLMs, setting a benchmark for future developments in AI-based multimodal understanding systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1783690031787376780

https://twitter.com/Gradio/status/1783896592572940362

https://twitter.com/fly51fly/status/1784698504600014971

https://twitter.com/javaeeeee1/status/1783822467875942901

https://twitter.com/arxivsanitybot/status/1784937020910968837

https://twitter.com/susumuota/status/1789808926092910828

YouTube

Show All Videos