Emergent Mind

Abstract

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

Interleaved composition created using InternLM-XComposer2.

Overview

  • The paper presents InternLM-XComposer2, an advanced Vision-Language Model (VLM) skilled in text-image composition and understanding.

  • Partial LoRA (P-LoRA) is introduced to enhance the model’s ability to handle image tokens, improving comprehension and composition capabilities.

  • A carefully curated, complex, and diverse dataset underpins the model's performance, enabling it to handle a range of instructive and creative tasks.

  • InternLM-XComposer2 shows superior performance on various benchmarks, outperforming open-source models and rivaling top-tier models like GPT-4V and Gemini Pro.

  • The model's potential for multi-modal understanding heralds a new era in content generation and AI-assisted creative processes.

Introduction to LLMs

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Models (VLMs) represents a significant advancement in the field of VLMs. It excels in both comprehension of visual elements and text-image composition, offering highly customizable content creation across a wide spectrum of application contexts.

Partial LoRA and Data Foundation

The model's capabilities are amplified through two critical design elements. The first is the Partial LoRA (P-LoRA) which strategically applies additional LoRA parameters to image tokens, harmonizing capability in composition and comprehension. Secondly, high quality and diverse data foundation is essential. The dataset is expertly curated, being rich in complexity and multifaceted, varying from simple instruction adherence to customization of content with a plethora of materials.

Performance Benchmarks and Advances

InternLM-XComposer2’s performance across various benchmarks is noteworthy. It not only significantly surpasses existing open-source MLLMs but also competes with advanced models like GPT-4V and Gemini Pro, particularly excelling in free-form text-image composition demonstrated in the OpenCompass benchmark for evaluating the creativity of LLMs.

The Future of Vision-Language Understanding

The sophistication of InternLM-XComposer2 combined with robust methodologies such as Partial LoRA and a rich data foundation hold promise for the future of multimodal understanding. Its proficiency in nuanced perception, intricate reasoning, and knowledge integration place it at the forefront of VLM advancements, with potential applications ranging from content generation to AI-augmented creative endeavors.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.