mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (2307.02499v1)
Abstract: Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model LLMs (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- DUE: end-to-end document understanding benchmark. In NeurIPS Datasets and Benchmarks, 2021.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
- Tabfact : A large-scale dataset for table-based fact verification. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020.
- End-to-end document recognition and understanding with dessurt. In ECCV Workshops (4), volume 13804 of Lecture Notes in Computer Science, pages 280–296. Springer, 2022.
- Question-controlled text-aware image captioning. In ACM Multimedia, pages 3097–3105. ACM, 2021.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Layoutlmv3: Pre-training for document AI with unified text and image masking. In ACM Multimedia, pages 4083–4091. ACM, 2022.
- Ocr-free document understanding transformer. In ECCV (28), volume 13688 of Lecture Notes in Computer Science, pages 498–517. Springer, 2022.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. CoRR, abs/2210.03347, 2022.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, pages 7241–7259. Association for Computational Linguistics, 2022.
- Visual instruction tuning. CoRR, abs/2304.08485, 2023a.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL (Findings), pages 2263–2279. Association for Computational Linguistics, 2022.
- Docvqa: A dataset for VQA on document images. In WACV, pages 2199–2208. IEEE, 2021.
- Infographicvqa. In WACV, pages 2582–2591. IEEE, 2022.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In ACL (1), pages 1470–1480. The Association for Computer Linguistics, 2015.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
- Textcaps: A dataset for image captioning with reading comprehension. In ECCV (2), volume 12347 of Lecture Notes in Computer Science, pages 742–758. Springer, 2020.
- Towards VQA models that can read. In CVPR, pages 8317–8326. Computer Vision Foundation / IEEE, 2019.
- Kleister: Key information extraction datasets involving long documents with complex layouts. In ICDAR (1), volume 12821 of Lecture Notes in Computer Science, pages 564–579. Springer, 2021.
- S. Svetlichnaya. Deepform: Understand structured documents at scale, 2020.
- Visualmrc: Machine reading comprehension on document images. In AAAI, pages 13878–13888. AAAI Press, 2021.
- Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Vicuna. Vicuna: An open chatbot impressing gpt-4. https://github.com/lm-sys/FastChat, 2023.
- Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560, 2022. doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. CoRR, abs/2304.01196, 2023a.
- mplug-2: A modularized multi-modal foundation model across text, image and video. CoRR, abs/2302.00402, 2023b.
- Layoutlm: Pre-training of text and layout for document image understanding. In R. Gupta, Y. Liu, J. Tang, and B. A. Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 1192–1200. ACM, 2020. doi: 10.1145/3394486.3403172. URL https://doi.org/10.1145/3394486.3403172.
- TAP: text-aware pre-training for text-vqa and text-caption. In CVPR, pages 8751–8761. Computer Vision Foundation / IEEE, 2021.
- MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.