mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (2307.02499v1)
Abstract: Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model LLMs (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- DUE: end-to-end document understanding benchmark. In NeurIPS Datasets and Benchmarks, 2021.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
- Tabfact : A large-scale dataset for table-based fact verification. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020.
- End-to-end document recognition and understanding with dessurt. In ECCV Workshops (4), volume 13804 of Lecture Notes in Computer Science, pages 280–296. Springer, 2022.
- Question-controlled text-aware image captioning. In ACM Multimedia, pages 3097–3105. ACM, 2021.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Layoutlmv3: Pre-training for document AI with unified text and image masking. In ACM Multimedia, pages 4083–4091. ACM, 2022.
- Ocr-free document understanding transformer. In ECCV (28), volume 13688 of Lecture Notes in Computer Science, pages 498–517. Springer, 2022.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. CoRR, abs/2210.03347, 2022.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, pages 7241–7259. Association for Computational Linguistics, 2022.
- Visual instruction tuning. CoRR, abs/2304.08485, 2023a.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL (Findings), pages 2263–2279. Association for Computational Linguistics, 2022.
- Docvqa: A dataset for VQA on document images. In WACV, pages 2199–2208. IEEE, 2021.
- Infographicvqa. In WACV, pages 2582–2591. IEEE, 2022.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In ACL (1), pages 1470–1480. The Association for Computer Linguistics, 2015.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
- Textcaps: A dataset for image captioning with reading comprehension. In ECCV (2), volume 12347 of Lecture Notes in Computer Science, pages 742–758. Springer, 2020.
- Towards VQA models that can read. In CVPR, pages 8317–8326. Computer Vision Foundation / IEEE, 2019.
- Kleister: Key information extraction datasets involving long documents with complex layouts. In ICDAR (1), volume 12821 of Lecture Notes in Computer Science, pages 564–579. Springer, 2021.
- S. Svetlichnaya. Deepform: Understand structured documents at scale, 2020.
- Visualmrc: Machine reading comprehension on document images. In AAAI, pages 13878–13888. AAAI Press, 2021.
- Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Vicuna. Vicuna: An open chatbot impressing gpt-4. https://github.com/lm-sys/FastChat, 2023.
- Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560, 2022. doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. CoRR, abs/2304.01196, 2023a.
- mplug-2: A modularized multi-modal foundation model across text, image and video. CoRR, abs/2302.00402, 2023b.
- Layoutlm: Pre-training of text and layout for document image understanding. In R. Gupta, Y. Liu, J. Tang, and B. A. Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 1192–1200. ACM, 2020. doi: 10.1145/3394486.3403172. URL https://doi.org/10.1145/3394486.3403172.
- TAP: text-aware pre-training for text-vqa and text-caption. In CVPR, pages 8751–8761. Computer Vision Foundation / IEEE, 2021.
- MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.