KOSMOS-2.5: A Multimodal Literate Model

Published 20 Sep 2023 in cs.CL and cs.CV | (2309.11419v2)

Abstract: The automatic reading of text-intensive images represents a significant advancement toward achieving AGI. In this paper we present KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on a large-scale corpus of text-intensive images, KOSMOS-2.5 excels in two distinct yet complementary transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned spatial coordinates within the image, and (2) producing structured text output that captures both style and structure in markdown format. This unified multimodal literate capability is achieved through a shared decoder-only autoregressive Transformer architecture and task-specific prompts. Building on this foundation, we fine-tune KOSMOS-2.5 for document understanding tasks, resulting in a document understanding generalist named KOSMOS-2.5-CHAT. Additionally, a large corpus of 357.4 million document pages spanning diverse domains was curated for pre-training. We evaluate KOSMOS-2.5 on two newly proposed benchmarks, OCREval and MarkdownEval, for document-level text recognition and image-to-markdown generation, demonstrating impressive literate capabilities comparable to GPT-4o. KOSMOS-2.5-CHAT achieves performance comparable to other state-of-the-art generalists that are five times larger (1.3B vs. 7B) across nine text-rich visual question answering benchmarks. Models and code have been available at \url{https://aka.ms/kosmos25}.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Citations (47)

View on Semantic Scholar

Summary

The paper introduces Kosmos-2.5, a unified decoder-only model that integrates spatially-aware text block and Markdown generation.
It employs a shared Transformer, Vision Transformer encoder, and resampler to efficiently process diverse text-rich image inputs.
Experimental results demonstrate superior F1, precision, and recall, verifying high fidelity in complex document layouts.

Kosmos-2.5: A Unified Approach for Text-Intensive Image Understanding

The paper on Kosmos-2.5 introduces an advanced framework for multimodal literate models, focusing on enhancing capabilities for machine reading of text-intensive images. This model builds upon the architecture of Kosmos-2, implementing a unified decoder-only structure to perform two primary tasks: spatially-aware text block generation and Markdown-formatted structured text generation. By leveraging a shared Transformer architecture, Kosmos-2.5 operates effectively across diverse document types, understanding and transcribing images while capturing textual content and layout structures.

Key Contributions and Architecture

Kosmos-2.5 represents a significant progression in text image understanding by integrating dual transcription tasks within a single model. This transition from typical encoder-decoder models to a decoder-only format marks a paradigm shift, optimizing application interfaces and simplifying multimodal LLM tasks. The model employs a Vision Transformer (ViT) as a vision encoder and a Transformer-based language decoder, linked by a resampler module for efficient image embedding processing.

The innovative dual-task training strategy enhances Kosmos-2.5's versatility across text-rich image understanding tasks. It is pretrained using a comprehensive dataset comprising various document types and formats, including scanned documents, academic papers, presentation slides, PDFs, and webpages. The model, therefore, adapts seamlessly to multiple input configurations, from layout-based text alignments with bounding boxes to structured outputs in Markdown format.

Experimental Evaluation

Kosmos-2.5's abilities were evaluated on several text recognition datasets, such as FUNSD, SROIE, and CORD, demonstrating superior F1 performance compared to existing commercial OCR solutions. The model's evaluative metrics, including precision and recall, underscored its adeptness at recognizing and transcribing text from intricate document layouts accurately.

Moreover, Kosmos-2.5 excelled in the image-to-markdown generation task, significantly outperforming contemporary models like Nougat. The use of specialized metrics such as Normalized Edit Distance (NED) and Normalized Tree Edit Distance (NTED) verified the model’s proficiency in maintaining lexical and structural fidelity in generated Markdown outputs across various datasets. These results accentuate Kosmos-2.5's capability to accurately interpret document layouts and produce high-fidelity text outputs, establishing its efficacy in diverse real-world document processing scenarios.

Implications and Future Directions

Kosmos-2.5's development underscores the potential of a unified, task-agnostic approach for multimodal literate models, delivering robust solutions for text-intensive image understanding. Its architectural simplicity and adaptability hint at future possibilities for scaling in multimodal applications, especially given its proficiency in few-shot and zero-shot scenarios. Future work could explore enhancing the model's fine-grained control over document element positions using natural language instructions and expanding its capabilities to handle multi-page document contexts.

Additionally, with advancements in LLM integration, Kosmos-2.5 provides a foundational structure for coupling with more powerful LLMs. This could amplify its contextual understanding and application in broader AI tasks, reinforcing the trend towards more comprehensive AI models capable of seamless human-like interaction and understanding across multimedia contexts. Addressing these challenges will pave the way for next-generation AI models tasked with interpreting and generating human-readable content efficiently from diverse data sources.

Markdown Report Issue