Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 225 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (2407.03320v1)

Published 3 Jul 2024 in cs.CV and cs.CL

Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision LLM that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

Citations (51)

View on Semantic Scholar

Summary

The paper presents InternLM-XComposer-2.5 as a breakthrough in vision-language models by enhancing high-resolution image understanding, fine-grained video analysis, and multi-turn multi-image dialogue.
It employs a 560×560 Vision Transformer, RoPE extrapolation for extended context, and partial LoRA alignment to generate web pages and compose detailed text-image articles.
Benchmark tests reveal its state-of-the-art performance in visual QA, video understanding, and dialogue tasks, demonstrating its potential for diverse advanced applications.

InternLM-XComposer-2.5: A Versatile Large Vision LLM Supporting Long-Contextual Input and Output

The presented paper discusses InternLM-XComposer-2.5 (IXC-2.5), a significant advancement in Large Vision LLMs (LVLMs) that emphasizes long-contextual input and output capabilities, enabling a multitude of sophisticated applications across text-image comprehension and composition. This model represents a substantial progress over its predecessor, IXC-2.0, primarily due to its enhanced architecture and expanded capabilities.

Key Model Enhancements

IXC-2.5 incorporates three primary enhancements aimed at advancing vision-language comprehension:

Ultra-High Resolution Understanding: Utilizing a 560 × 560 Vision Transformer (ViT) encoder, IXC-2.5 enables the processing of high-resolution images with various aspect ratios.
Fine-Grained Video Understanding: Videos are treated as high-resolution composite images comprising numerous frames, capturing fine details via dense sampling of each frame.
Multi-Turn Multi-Image Dialogue: The model supports extended, complex interactions involving multiple images over many turns, improving the fluidity of human-like conversations.

Furthermore, IXC-2.5 extends its capabilities to two critical text-image composition applications:

Crafting Webpages: Leveraging additional LoRA parameters, IXC-2.5 can generate source codes for webpages from text-image instructions.
Composing High-Quality Text-Image Articles: Implementing Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques, IXC-2.5 produces high-quality written content with corresponding images.

Training and Model Architecture

The IXC-2.5 model emphasizes long-contextual interaction, trained with 24,000 interleaved image-text contexts and capable of extending to 96,000 contexts via RoPE extrapolation. The architecture includes a Vision Encoder OpenAI ViT-L/14, a LLM InternLM2-7B, and Partial LoRA for alignment with vision encoders.

The pre-training phase focuses on three tasks using a diverse dataset: General Semantic Alignment, World Knowledge Alignment, and Vision Capability Enhancement. This preparatory phase ensures the model's adeptness in processing various vision-language inputs.

Benchmark Performance

IXC-2.5 demonstrates state-of-the-art performance across a wide range of benchmarks:

Video Understanding: Outperformed existing models on four out of five benchmarks, including MVBench and MME-Video, delineating its proficiency in fine-grained video tasks.
Structural High-Resolution Benchmarks: Achieved notable results on DocVQA, ChartQA, TextVQA, demonstrating its capacity to handle complex visual information.
General Visual QA Benchmarks: Excelled in MMStar, RealWorldQA, and others, showcasing its versatility.
Multi-Image Multi-Turn Dialogue: Surpassed prior models in MMDU, highlighting advanced conversational abilities.

Webpage Generation and Article Composition

In terms of webpage generation, IXC-2.5 extends its capabilities to:

Screenshot-to-code: Achieved high scores on the Design2Code benchmark, demonstrating near GPT-4v level performance in translating visual designs into code.
Instruction-Aware Webpage Generation: Trained on synthetic and real-world datasets to convert textual instructions into webpage designs, including interactive elements like JavaScript.
Resume-to-homepage: Created personal homepages from resumes, showcasing practical applicability.

For article composition, the model utilizes a multi-step pipeline including supervision, reward modeling, preference data collection, and DPO alignment, resulting in stable and high-quality text-image articles.

Implications and Future Directions

The enhancements and comprehensive capabilities of IXC-2.5 make it a robust tool for a variety of practical applications, from webpage design to intricate visual QA tasks. The model’s ability to handle long-contextual interactions positions it as a pivotal advancement for future AI developments. Future research can extend IXC-2.5’s long-context capabilities to more complex and extended multi-modal environments, such as continuous video streams or prolonged dialogue histories, thus broadening its applicability in real-world scenarios.