Emergent Mind

Abstract

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

InternVL 1.5 offers advanced visual representation, flexible resolution, and bilingual English-Chinese proficiency, enhancing its MLLM competitiveness.

Overview

  • InternVL 1.5 is an advanced open-source multimodal large language model (MLLM) with enhancement features aiming to rival commercial proprietary models by leveraging a strong vision encoder, a dynamic high-resolution strategy, and a comprehensive bilingual dataset.

  • The model has demonstrated superior performance in various benchmarks, particularly in OCR tasks, outperforming prominent models like Grok-1.5V and GPT-4V, hence showcasing the effectiveness of its new functionalities and dataset.

  • Future enhancements for InternVL 1.5 are planned to include expanded multilanguage capabilities, improved image processing for more varied document types, and enhanced interactive features to further its lead in multimodal understanding tasks.

InternVL 1.5: Bridging the Gap in Multimodal Understanding between Open-Source and Proprietary Models

Overview

The report introduces InternVL 1.5, an upgraded open-source multimodal large language model (MLLM). This model incorporates major enhancements designed to close the existing capability gap between open-source and commercial proprietary models. By implementing a strong vision encoder, a dynamic high-resolution strategy, and a comprehensive high-quality bilingual dataset, InternVL 1.5 aims to robustly enhance performance in a variety of multimodal understanding tasks.

Key Improvements

  • Strong Vision Encoder: The model introduces continuous learning improvements to InternViT-6B, a key component for boosting visual understanding. This allows for a significant enhancement in visual content adaptability and transferability across different LLM implementations.
  • Dynamic High-Resolution Strategy: InternVL 1.5 handles images by dividing them into tiles based on input resolution and aspect ratio, supporting resolutions up to 4K. This technique provides flexibility in dealing with different image types and resolutions, improving the model's performance in detailed scene and document understanding.
  • High-Quality Bilingual Dataset: The dataset encompasses a diversified range of natural scenes, documents, and conversations in English and Chinese. This dataset not only enriches the model’s training data but also extends its optimization for performance in OCR and language-specific tasks.

Performance Analysis

InternVL 1.5 shows impressive results in benchmarking compared to both open-source and proprietary models, achieving state-of-the-art results in 8 out of its 18 evaluated benchmarks. Notably, it outperforms many leading proprietary models such as Grok-1.5V and GPT-4V in several OCR-related benchmarks.

Theoretical and Practical Implications

Theoretical Advancements:

  • The integration of a strong vision encoder exemplifies advancements in continuous learning strategies that refine a model’s adaptability and enhance its performance over a diverse set of visual inputs.
  • The approach of using dynamic high-resolution processing highlights innovative ways to handle various image resolutions effectively, enriching the research in responsive AI-based image processing.

Practical Implications:

  • The enriching of bilingual dataset capabilities opens avenues for real-world applications to handle multilingual contexts more proficiently.
  • Achieving higher accuracy in OCR-related benchmarks implies practical usage in text extraction from documents and images, facilitating robust applications in areas like automated document processing and content management systems.

Future Directions

The ongoing development of InternVL 1.5 and its contribution to narrowing the performance gap highlights the potential future enhancements. These could include expanding the multilanguage capabilities to include more languages and dialects, further refining the image processing capabilities to handle more complex and varied document types, and enhancing the model's interactive capabilities for more refined multimodal interactions. Furthermore, as the AI field evolves, more robust and generalized training approaches could be developed to handle an even broader spectrum of multimodal tasks.

In conclusion, InternVL 1.5 represents a significant advance in the realm of open-source MLLMs, setting a benchmark for future developments in AI-based multimodal understanding systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube