InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
(2407.03320)Abstract
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.
Overview
-
InternLM-XComposer-2.5 (IXC-2.5) is an advanced Vision Language Model (LVLM) that highlights long-contextual input and output capabilities, facilitating sophisticated text-image comprehension and composition.
-
The model introduces enhancements like ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue to improve vision-language interaction.
-
IXC-2.5 excels in webpage generation and high-quality text-image article composition, achieving state-of-the-art performance across various benchmarks, and demonstrating practical applications such as translating visual designs into code and creating personal homepages from resumes.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
The presented paper discusses InternLM-XComposer-2.5 (IXC-2.5), a significant advancement in Large Vision Language Models (LVLMs) that emphasizes long-contextual input and output capabilities, enabling a multitude of sophisticated applications across text-image comprehension and composition. This model represents a substantial progress over its predecessor, IXC-2.0, primarily due to its enhanced architecture and expanded capabilities.
Key Model Enhancements
IXC-2.5 incorporates three primary enhancements aimed at advancing vision-language comprehension:
- Ultra-High Resolution Understanding: Utilizing a 560 × 560 Vision Transformer (ViT) encoder, IXC-2.5 enables the processing of high-resolution images with various aspect ratios.
- Fine-Grained Video Understanding: Videos are treated as high-resolution composite images comprising numerous frames, capturing fine details via dense sampling of each frame.
- Multi-Turn Multi-Image Dialogue: The model supports extended, complex interactions involving multiple images over many turns, improving the fluidity of human-like conversations.
Furthermore, IXC-2.5 extends its capabilities to two critical text-image composition applications:
- Crafting Webpages: Leveraging additional LoRA parameters, IXC-2.5 can generate source codes for webpages from text-image instructions.
- Composing High-Quality Text-Image Articles: Implementing Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques, IXC-2.5 produces high-quality written content with corresponding images.
Training and Model Architecture
The IXC-2.5 model emphasizes long-contextual interaction, trained with 24,000 interleaved image-text contexts and capable of extending to 96,000 contexts via RoPE extrapolation. The architecture includes a Vision Encoder OpenAI ViT-L/14, a Large Language Model InternLM2-7B, and Partial LoRA for alignment with vision encoders.
The pre-training phase focuses on three tasks using a diverse dataset: General Semantic Alignment, World Knowledge Alignment, and Vision Capability Enhancement. This preparatory phase ensures the model's adeptness in processing various vision-language inputs.
Benchmark Performance
IXC-2.5 demonstrates state-of-the-art performance across a wide range of benchmarks:
- Video Understanding: Outperformed existing models on four out of five benchmarks, including MVBench and MME-Video, delineating its proficiency in fine-grained video tasks.
- Structural High-Resolution Benchmarks: Achieved notable results on DocVQA, ChartQA, TextVQA, demonstrating its capacity to handle complex visual information.
- General Visual QA Benchmarks: Excelled in MMStar, RealWorldQA, and others, showcasing its versatility.
- Multi-Image Multi-Turn Dialogue: Surpassed prior models in MMDU, highlighting advanced conversational abilities.
Webpage Generation and Article Composition
In terms of webpage generation, IXC-2.5 extends its capabilities to:
- Screenshot-to-code: Achieved high scores on the Design2Code benchmark, demonstrating near GPT-4v level performance in translating visual designs into code.
- Instruction-Aware Webpage Generation: Trained on synthetic and real-world datasets to convert textual instructions into webpage designs, including interactive elements like JavaScript.
- Resume-to-homepage: Created personal homepages from resumes, showcasing practical applicability.
For article composition, the model utilizes a multi-step pipeline including supervision, reward modeling, preference data collection, and DPO alignment, resulting in stable and high-quality text-image articles.
Implications and Future Directions
The enhancements and comprehensive capabilities of IXC-2.5 make it a robust tool for a variety of practical applications, from webpage design to intricate visual QA tasks. The model’s ability to handle long-contextual interactions positions it as a pivotal advancement for future AI developments. Future research can extend IXC-2.5’s long-context capabilities to more complex and extended multi-modal environments, such as continuous video streams or prolonged dialogue histories, thus broadening its applicability in real-world scenarios.