TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models (2404.09204v1)
Abstract: Multimodal LLMs (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.
- Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis. CoRR abs/2403.04080 (2024).
- Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRR abs/2308.12966 (2023).
- Nougat: Neural Optical Understanding for Academic Documents. CoRR abs/2308.13418 (2023).
- End-to-End Object Detection with Transformers. In Eur. Conf. Comput. Vis., Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12346. 213–229.
- Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In IEEE Conf. Comput. Vis. Pattern Recog. 3558–3568.
- ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model. CoRR abs/2402.11684 (2024).
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. CoRR abs/2306.15195 (2023).
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. CoRR abs/2311.12793 (2023).
- TabFact: A Large-scale Dataset for Table-based Fact Verification. In Int. Conf. Learn. Represent.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. CoRR abs/2305.06500 (2023).
- UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding. CoRR abs/2308.11592 (2023).
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. CoRR abs/2306.13394 (2023).
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis. 127, 4 (2019), 398–414.
- Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, 6700–6709.
- From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models. CoRR abs/2310.08825 (2023).
- ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proc. EMNLP, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). 787–798.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In Proc. ICML, Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. 18893–18912.
- Building a test collection for complex document information processing. In ACM Int. Conf. Multimedia, Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin (Eds.). 665–666.
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. CoRR abs/2307.16125 (2023).
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proc. ICML, Vol. 202. 19730–19742.
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. In IEEE Conf. Comput. Vis. Pattern Recog.
- Feature Pyramid Networks for Object Detection. In IEEE Conf. Comput. Vis. Pattern Recog. 936–944.
- Improved Baselines with Visual Instruction Tuning. CoRR abs/2310.03744 (2023).
- Visual Instruction Tuning. In Adv. Neural Inform. Process. Syst.
- MMBench: Is Your Multi-modal Model an All-around Player? CoRR abs/2307.06281 (2023).
- TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document. CoRR abs/2403.04473 (2024).
- Point and Ask: Incorporating Pointing into Visual Question Answering. CoRR abs/2011.13681 (2020).
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In IEEE Conf. Comput. Vis. Pattern Recog. 3195–3204.
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In Proc. ACL, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). 2263–2279.
- InfographicVQA. In Proc. WACV. 2582–2591.
- DocVQA: A Dataset for VQA on Document Images. In Proc. WACV. 2199–2208.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In Proc. ICDAR. 947–952.
- Im2Text: Describing Images Using 1 Million Captioned Photographs. In Adv. Neural Inform. Process. Syst., John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger (Eds.). 1143–1151.
- Panupong Pasupat and Percy Liang. 2015. Compositional Semantic Parsing on Semi-Structured Tables. In Proc. ACL. 1470–1480.
- Kosmos-2: Grounding Multimodal Large Language Models to the World. CoRR abs/2306.14824 (2023).
- Learning Transferable Visual Models From Natural Language Supervision. In Proc. ICML, Marina Meila and Tong Zhang (Eds.), Vol. 139. 8748–8763.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. CoRR abs/2111.02114 (2021).
- A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In Eur. Conf. Comput. Vis., Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.), Vol. 13668. 146–162.
- Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proc. ACL, Iryna Gurevych and Yusuke Miyao (Eds.). 2556–2565.
- TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Eur. Conf. Comput. Vis., Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12347. 742–758.
- SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). 13636–13645.
- VisualMRC: Machine Reading Comprehension on Document Images. In AAAI. AAAI Press, 13878–13888.
- InternLM Team. 2023. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. https://github.com/InternLM/InternLM.
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023).
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. CoRR abs/2307.02499 (2023).
- UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. In Proc. EMNLP, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 2841–2858.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. CoRR abs/2304.14178 (2023).
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. CoRR abs/2311.04257 (2023).
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 (2014), 67–78.
- Sigmoid Loss for Language Image Pre-Training. In Int. Conf. Comput. Vis. 11941–11952.
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition. CoRR abs/2309.15112 (2023).
- SVIT: Scaling up Visual Instruction Tuning. CoRR abs/2307.04087 (2023).
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Int. Conf. Learn. Represent.