Emergent Mind

Abstract

Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal LLMs (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

Unified Structure Learning concept explained through DocOwl 1.5 illustration.

Overview

  • This paper introduces Unified Structure Learning and DocOwl 1.5, a model enhancing Multimodal LLMs (MLLMs) to understand text-rich images without OCR.

  • Unified Structure Learning involves structure-aware parsing and multi-grained text localization for diverse document types, advancing beyond typical visual text recognition.

  • DocOwl 1.5 features a novel vision-to-text module, H-Reducer, that maintains layout information in high-resolution images and leverages new datasets, DocStruct4M and DocReason25K, tailored to structure learning and reasoning.

  • DocOwl 1.5 outperforms existing models in visual document understanding tasks, highlighting its potential for OCR-free MLLM applications and leading to better document comprehension.

Unified Structure Learning for OCR-free Document Understanding with DocOwl 1.5

Introduction to Unified Structure Learning

In the quest to enhance the capabilities of Multimodal LLMs (MLLMs) in understanding text-rich document images without relying on Optical Character Recognition (OCR), this paper introduces Unified Structure Learning and presents DocOwl 1.5, a model that significantly improves upon the state-of-the-art. The principal innovation lies in the comprehensive approach to encoding structure information across different types of text-rich images, including documents, tables, charts, webpages, and natural images. Traditional MLLMs struggle with such images due to their reliance on visual encoders trained predominantly on natural image-text pairs, which do not optimally represent the textual and structural intricacies of document images.

Key Contributions

The contributions of this work are manifold:

  • Introduction of Unified Structure Learning which comprises structure-aware parsing tasks and multi-grained text localization tasks, covering a broad spectrum of document types.
  • Design of a highly effective vision-to-text module, termed H-Reducer, which efficiently processes high-resolution images while preserving vital layout information.
  • Construction of a novel dataset, DocStruct4M, specifically designed to facilitate Unified Structure Learning, alongside a reasoning tuning dataset DocReason25K aimed at eliciting model's detailed explanation capabilities.
  • Demonstrated superiority of DocOwl 1.5 over existing models, achieving significant performance gains on 10 benchmark visual document understanding tasks.

The Innovation of Unified Structure Learning

Unified Structure Learning is at the heart of DocOwl 1.5's advancements. Distinctly, it focuses on understanding not just the text but the structure within text-rich images through structure-aware parsing and multi-grained text localization across diverse domains. For structure-aware parsing, the model learns to interpret documents, tables, charts, webpages, and natural images by encoding structural cues such as line feeds, spaces, and extended Markdown syntax to represent complex structures like tables and charts. In doing so, it enhances the model's comprehension of document layout beyond mere text recognition.

The multi-grained text localization tasks enrich the model's precision in correlating text to its spatial context within images. This dual approach, bridging text recognition and structural understanding, equips the model to tackle a wide array of visual document understanding tasks.

Architectural Advancements

DocOwl 1.5 leverages H-Reducer, a vision-to-text module crafted for balancing efficiency with the retention of spatial and layout information critical for high-resolution document image processing. Unlike traditional modules that either elongate visual feature sequences or compromise spatial information fidelity, H-Reducer employs convolution to aggregate horizontally adjacent visual features. This significantly reduces visual feature sequence lengths while maintaining the relative positional relationships essential for accurately interpreting text-rich documents.

Comprehensive Dataset Construction

The creation of DocStruct4M and DocReason25K datasets marks a pivotal stride towards fostering model training and evaluation in OCR-free document understanding. DocStruct4M supports Unified Structure Learning by offering a rich compilation of structure-aware sequences and multi-grained pairs of text and bounding boxes, spanning across varied document types. Concurrently, DocReason25K aids in refining the model's ability to generate detailed explanations by providing high-quality instruction tuning focused on reasoning within document domains.

Empirical Validation and Theoretical Implications

DocOwl 1.5's empirical achievements underscore its unprecedented capabilities in visual document understanding tasks. Achieving a significant performance leap across 10 visual document understanding benchmarks, DocOwl 1.5 not only sets new performance standards but also highlights the efficacy of Unified Structure Learning in holistically parsing and understanding diverse document types without OCR dependency.

This research holds profound practical and theoretical implications, paving the way for enhanced document understanding that could redefine OCR-free MLLM applications in various domains. Further on, it opens avenues for exploring novel multimodal learning strategies that could further bridge the gap between human-like understanding and AI in processing complex visual documents.

Conclusion

In summary, this work's innovative approach to Unified Structure Learning, coupled with the introduction of H-Reducer and the meticulous assembly of specialized datasets, propels DocOwl 1.5 to the forefront of OCR-free visual document understanding. It signifies a substantial advancement in the field, offering a robust foundation for future explorations aimed at further unraveling the intricacies of multimodal understanding in text-rich image contexts.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube