Emergent Mind

Abstract

Multimodel LLMs(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

State-of-the-art Multi-page Document Understanding, faster inference, less GPU memory, detailed explanations, structure parsing.

Overview

  • Introduction of mPLUG-DocOwl2

  • mPLUG-DocOwl2 is introduced as a novel architecture for improving efficiency and performance in multi-page document understanding without relying on Optical Character Recognition (OCR).

  • Technical Innovations

  • The paper details technical innovations such as High-resolution DocCompressor, Shape-adaptive Cropping Module, Vision-to-Text Module (H-Reducer), and Cross-Attention Based Compression that contribute to efficient token usage and superior document comprehension.

  • Performance and Implications

  • The model demonstrates significant improvements in performance and efficiency, achieving state-of-the-art results on several benchmarks, reducing computational resources, and offering practical implications for real-world applications.

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

The paper "mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding" addresses critical challenges in the domain of Optical Character Recognition-free (OCR-free) document analysis using Multimodal LLMs (MLLMs). It introduces mPLUG-DocOwl2, a novel architecture designed to enhance the efficiency and performance of multi-page document understanding by employing a sophisticated high-resolution compression method.

Overview of Contributions

The authors present several key innovations and findings:

  1. High-resolution DocCompressor: The core contribution of this work is the High-resolution DocCompressor. Unlike traditional models that use numerous visual tokens for a high-resolution document, this scheme compresses image information significantly. Specifically, it converts high-resolution document images into only 324 tokens guided by low-resolution visual features. This method ensures efficient token usage without a substantial loss of textual and layout information.

  2. Three-Stage Training: The model leverages a three-phase training strategy: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. This approach enhances the model's ability to handle both single-page and multi-page document comprehension tasks comprehensively.

  3. Performance and Efficiency: mPLUG-DocOwl2 is demonstrated to significantly reduce GPU memory consumption and inference time while maintaining or surpassing the performance benchmarked by other state-of-the-art models. Notably, it shows over 50% reduction in latency for the first token and achieves comparable single-page understanding with only 20% of the visual tokens used by some other models.

Technical Approach

The technical progress made by mPLUG-DocOwl2 includes several noteworthy components:

Shape-adaptive Cropping Module:

This preprocessing step partitions high-resolution images into manageable sub-images while preserving global context through low-resolution images. This ensures that the overall document layout is maintained without overwhelming the model with excessive tokens.

Vision-to-Text Module (H-Reducer):

The H-Reducer module aligns visual features with the textual feature space of LLMs after a convolutional operation that condenses horizontal features. This alignment facilitates more efficient cross-modal learning.

Cross-Attention Based Compression:

The High-resolution DocCompressor utilizes a cross-attention mechanism, where global visual features act as queries to compress aligned high-resolution features effectively. This allows the model to maintain essential visual-situated text information within a drastically reduced token count.

Results

mPLUG-DocOwl2 sets a new benchmark in multi-page document understanding by achieving state-of-the-art performance on relevant datasets with significantly fewer visual tokens and reduced computational demands. It is particularly notable for:

Single-image Performance:

Achieving superior results on single-page document benchmarks like DocVQA, InfoVQA, and others, with a considerable reduction in visual token count.

Multi-page Document Understanding:

Outperforming existing models on multi-page datasets such as MP-DocVQA and DUDE, showcasing its advanced comprehension capabilities across document structures.

Latency and Resource Utilization:

Demonstrating a dramatic decrease in first token latency and GPU memory usage, which enables more scalable and practical application scenarios in real-world document processing.

Implications and Future Developments

The implications of this research are significant for both theoretical advancements and practical applications:

Theoretical:

The introduction of a high-resolution compression mechanism within an MLLM framework pushes the boundaries of how we understand and utilize visual tokens for compressing complex document layouts. This opens avenues for further research in optimizing multimodal learning with minimal token use.

Practical:

The reduced computational footprint and faster inference times make mPLUG-DocOwl2 suitable for deployment in resource-constrained environments, such as mobile devices or real-time applications, where quick and efficient document understanding is critical.

Conclusion

mPLUG-DocOwl2 represents a substantial step forward in the field of OCR-free document understanding. By balancing token efficiency, performance, and computational resource usage, it paves the way for more practical and scalable solutions in multimodal AI. Future developments could focus on further narrowing the gap between token utilization and understanding accuracy, enhancing the model's versatility across diverse document types and real-world scenarios. This work underscores the importance of efficient representation in high-resolution image processing and sets a new standard for future research in the domain.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube