InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (2401.13313v1)
Abstract: We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and LLMs through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
- Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
- Docformer: End-to-End Transformer for Document Understanding. In CVPR, 993–1003.
- DocFormerv2: Local Features for Document Understanding. arXiv:2306.01733.
- PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. In ACL-demo, 93–104.
- Scene Text Visual Question Answering. In ICCV, 4290–4300.
- DUE: End-to-End Document Understanding Benchmark. In NeurIPS.
- PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv:2305.18565.
- WebSRC: A Dataset for Web-Based Structural Reading Comprehension. In EMNLP, 4173–4185.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
- Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In ICDAR, 991–995.
- SciCap: Generating Captions for Scientific Figures. In EMNLP Findings, 3258–3264.
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In ACMM, 4083–4091.
- ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR, 1516–1520.
- OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv:2212.12017.
- FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDARW.
- A Diagram Is Worth A Dozen Images. In ECCV, 235–251.
- Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In CVPR, 5376–5384.
- OCR-free Document Understanding Transformer. In ECCV, 498–517.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1): 32–73.
- The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 1956–1981.
- Document Understanding Dataset and Evaluation (DUDE). arXiv:2305.08455.
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In ICML, 18893–18912.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
- DocBank: A Benchmark Dataset for Document Layout Analysis. In COLING, 949–960.
- Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
- Visual Instruction Tuning. arXiv:2304.08485.
- On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895.
- The FLAN collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688.
- Decoupled Weight Decay Regularization. arXiv:1711.05101.
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS.
- IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In NeurIPS.
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In ACL Findings, 2263–2279.
- InfographicVQA. In WACV, 1697–1706.
- DocVQA: A Dataset for VQA on Document Images. In WACV, 2200–2209.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 947–952.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In ACL, 3470–3487.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS.
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation. In KDD, 3743–3751.
- Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
- DocILE Benchmark for Document Information Localization and Extraction. arXiv:2302.05658.
- Towards VQA Models That Can Read. In CVPR, 8317–8326.
- Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv:2103.14470.
- SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, 13636–13645.
- VisualMRC: Machine Reading Comprehension on Document Images. In AAAI, 13878–13888.
- Recognition-free Question Answering on Handwritten Document Collections. In ICFHR, 259–273.
- Screen2words: Automatic mobile UI summarization with multimodal learning. In UIST, 498–510.
- Finetuned language models are zero-shot learners. In ICLR.
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD, 1192–1200.
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL/IJCNLP, 2579–2591.
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, 11445–11465.
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv:2306.17107.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592.
- Towards Complex Document Understanding by Discrete Reasoning. In ACMM, 4857–4866.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.