mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (2307.02499v1)

Published 4 Jul 2023 in cs.CL and cs.AI

Abstract: Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model LLMs (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

References (37)

Citations (89)

View on Semantic Scholar

Summary

The paper introduces mPLUG-DocOwl as a modular multimodal LLM that enhances OCR-free document understanding by integrating distinct vision and language modules.
It employs unified instruction tuning across language-only, vision-language, and document-specific tasks to achieve significant zero-shot performance gains.
Experimental results demonstrate that mPLUG-DocOwl outperforms leading models in layout comprehension and precise information extraction.

Modularized Multimodal LLM for Document Understanding

The paper introduces mPLUG-DocOwl, an advanced extension of mPLUG-Owl, focusing on OCR-free document understanding. This work addresses the limitations of existing Multimodal LLMs (MLLMs), which struggle with intricate document features without in-domain training. To overcome these challenges, mPLUG-DocOwl adopts a modular architecture and a unified instruction tuning strategy, enhancing its capability in document understanding tasks without relying on OCR.

Methodology

mPLUG-DocOwl leverages a modular framework to integrate visual and textual knowledge, maintaining separate modules for vision and language tasks. The framework includes a visual abstractor that aligns visual information with a LLM, a methodology inspired by mPLUG-Owl. Crucially, the model is tuned across three domains: language-only, general vision-and-language, and specific document understanding datasets. This approach diversifies the model's capabilities and improves its zero-shot performance across various tasks.

A significant contribution is the creation of an instruction tuning dataset, consisting of tasks like Visual Question Answering (VQA), Information Extraction (IE), and Natural Language Inference (NLI), converted into a unified format suitable for integration within the existing mPLUG-Owl architecture. Additionally, the paper introduces the LLMDoc evaluation set, specifically designed to assess models' abilities concerning instruct compliance and document understanding, distinguishing itself from other existing benchmarks.

Experimental Results

The research demonstrates that mPLUG-DocOwl attains superior OCR-free performance on several standard benchmarks and across diverse domains. Notably, the model shows strong results in document, table, chart, natural image, and webpage understanding. mPLUG-DocOwl consistently outperforms previous models like Dessurt, Donut, and Pix2Struct, evidencing its enhanced text understanding and layout comprehension abilities. In qualitative analyses, the model excels in understanding complex document layouts and extracting precise information from visual inputs.

Additionally, human evaluations on the LLMDoc set reflect mPLUG-DocOwl's capability to produce high-quality responses and handle complex interactions. The evaluation proposes that while mPLUG-DocOwl demonstrates marked improvements, challenges remain, particularly in tasks requiring common sense reasoning and creative content generation.

Implications and Future Work

mPLUG-DocOwl's advances suggest significant potential implications for AI-driven document analysis and processing automation. Its modular design and integrated training regimen expand the applicability of MLLMs in real-world scenarios, where diverse document types and unstructured data prevail.

Future research may explore enhancing the model's reasoning and arithmetic capabilities, addressing current limitations in handling intricate semantic relationships and multi-step problem-solving. Furthermore, incorporating real-world feedback loops or continual learning mechanisms might bolster the model's adaptability to dynamically changing document environments.

Overall, mPLUG-DocOwl contributes a robust framework to the field of document understanding, setting a precedent for future developments in OCR-free multimodal models. Its performance improvements and methodological innovations lay a foundation for subsequent exploration in integrating advanced linguistic capabilities into document processing tasks.