Emergent Mind

Abstract

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and LLMs through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

Datasets of InstructDoc showcasing extensive VDU tasks, document types, and formats coverage.

Overview

  • The paper addresses the challenge of building models for Visual Document Understanding (VDU) that generalize over diverse tasks, by introducing a dataset and model leveraging human-written instructions.

  • InstructDoc is a new dataset that promotes zero-shot generalization in VDU tasks, featuring 12 tasks from 30 datasets unified under an instructional schema.

  • The introduced model, InstructDr, incorporates a Document-former module to effectively process document images with LLMs for improved zero-shot task performance.

  • InstructDr demonstrates superior zero-shot performance compared to existing multimodal LLMs and excels over ChatGPT in VDU tasks, highlighting the importance of instructions in model generalization.

  • There are limitations in the current approach, including reliance on OCR quality and challenges with multi-text document correlations, with potential future work to automate instruction generation.

Introduction

The burgeoning field of Visual Document Understanding (VDU) calls for robust models capable of handling a diversity of document-related tasks. As such, recent research has concentrated on improving models’ abilities to interpret the intricate relationship between textual and visual objects within documents. Despite this focus, creating a universal model that effectively transfers knowledge across various document types, formats, and tasks presents a significant challenge. In particular, most visual instruction tuning datasets and models have been limited, focusing primarily on scene images or lacking the ability to adapt to a wide array of VDU tasks. Aiming to bridge this gap, a novel approach has merged human-written instructions with visual documents to drive model generalization across unencountered VDU tasks.

InstructDoc Dataset and Model Advancement

The paper introduces InstructDoc, a pioneering dataset designed to foster zero-shot generalization in VDU tasks through the utilization of instructions. InstructDoc encompasses an extensive range of 12 tasks derived from 30 diverse datasets, all formulated within a uniform instruction schema. This schema is poised to require a complex set of competencies from models, such as grasping document layouts and interpreting visual representations of texts and objects. Notably, the novel model termed InstructDr has been developed to leverage this dataset. InstructDr integrates document images, image encoders, and LLMs via a trainable bridging module coined the Document-former. This module is key to transforming documents into representations digestible by LLMs, subsequently enhancing the models' zero-shot performance across VDU tasks when supplied with instructions.

Architectural Innovations and Empirical Evaluations

InstructDr, through its Document-former, is adept at mapping visual and textual document features into a space interpretable by an LLM. Experimental results reveal that InstructDr significantly surpasses the zero-shot performance of current multimodal LLMs and outperforms ChatGPT in numerous VDU tasks when aided by instructions. Such outcomes underscore the efficacy of instructions in improving model generalization and robustness. The model's architecture also supports multi-page document comprehension by encoding multiple document images in parallel, thereby enabling intricate reasoning across pages.

Critical Reflections and Future Prospects

Despite the merits of InstructDr, the research recognizes limitations, including a dependency on OCR quality and the inability to account for correlations among multiple document-text pairs. Furthermore, the possibility to enrich the dataset with automated instruction generation and augmentation remains unexplored. In summary, the advent of InstructDoc and the development of InstructDr signify a pivotal stride toward realizing general-purpose VDU models that comprehend and execute tasks guided by natural language instructions. This research crystallizes a vital contribution to the evolution of document AI, arguably setting a new benchmark for ensuing works in the discipline.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.