InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (2401.13313v1)

Published 24 Jan 2024 in cs.CV and cs.CL

Abstract: We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and LLMs through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

References (58)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces InstructDoc, a novel dataset designed to enable zero-shot generalization in visual document understanding through unified human-written instructions.
It details the InstructDr model, which integrates a trainable Document-former to efficiently bridge visual and textual features for enhanced task performance.
Experimental results demonstrate that InstructDr surpasses existing multimodal LLMs and ChatGPT on multiple VDU tasks, setting a new benchmark in document AI.

Introduction

The burgeoning field of Visual Document Understanding (VDU) calls for robust models capable of handling a diversity of document-related tasks. As such, recent research has concentrated on improving models’ abilities to interpret the intricate relationship between textual and visual objects within documents. Despite this focus, creating a universal model that effectively transfers knowledge across various document types, formats, and tasks presents a significant challenge. In particular, most visual instruction tuning datasets and models have been limited, focusing primarily on scene images or lacking the ability to adapt to a wide array of VDU tasks. Aiming to bridge this gap, a novel approach has merged human-written instructions with visual documents to drive model generalization across unencountered VDU tasks.

InstructDoc Dataset and Model Advancement

The paper introduces InstructDoc, a pioneering dataset designed to foster zero-shot generalization in VDU tasks through the utilization of instructions. InstructDoc encompasses an extensive range of 12 tasks derived from 30 diverse datasets, all formulated within a uniform instruction schema. This schema is poised to require a complex set of competencies from models, such as grasping document layouts and interpreting visual representations of texts and objects. Notably, the novel model termed InstructDr has been developed to leverage this dataset. InstructDr integrates document images, image encoders, and LLMs via a trainable bridging module coined the Document-former. This module is key to transforming documents into representations digestible by LLMs, subsequently enhancing the models' zero-shot performance across VDU tasks when supplied with instructions.

Architectural Innovations and Empirical Evaluations

InstructDr, through its Document-former, is adept at mapping visual and textual document features into a space interpretable by an LLM. Experimental results reveal that InstructDr significantly surpasses the zero-shot performance of current multimodal LLMs and outperforms ChatGPT in numerous VDU tasks when aided by instructions. Such outcomes underscore the efficacy of instructions in improving model generalization and robustness. The model's architecture also supports multi-page document comprehension by encoding multiple document images in parallel, thereby enabling intricate reasoning across pages.

Critical Reflections and Future Prospects

Despite the merits of InstructDr, the research recognizes limitations, including a dependency on OCR quality and the inability to account for correlations among multiple document-text pairs. Furthermore, the possibility to enrich the dataset with automated instruction generation and augmentation remains unexplored. In summary, the advent of InstructDoc and the development of InstructDr signify a pivotal stride toward realizing general-purpose VDU models that comprehend and execute tasks guided by natural language instructions. This research crystallizes a vital contribution to the evolution of document AI, arguably setting a new benchmark for ensuing works in the discipline.

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1750341210173014308

https://twitter.com/rtanaka_lab/status/1750427695039832442

https://twitter.com/fly51fly/status/1751737358566011145

https://twitter.com/arxivsanitybot/status/1750511539365298588