Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

112

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models (2404.09204v1)

Published 14 Apr 2024 in cs.CV and cs.AI

Abstract: Multimodal LLMs (MLLMs) have shown impressive results on various multimodal tasks. However, most existing MLLMs are not well suited for document-oriented tasks, which require fine-grained image perception and information compression. In this paper, we present TextHawk, a MLLM that is specifically designed for document-oriented tasks, while preserving the general capabilities of MLLMs. TextHawk is aimed to explore efficient fine-grained perception by designing four dedicated components. Firstly, a ReSampling and ReArrangement (ReSA) module is proposed to reduce the redundancy in the document texts and lower the computational cost of the MLLM. We explore encoding the positions of each local feature by presenting Scalable Positional Embeddings (SPEs), which can preserve the scalability of various image sizes. A Query Proposal Network (QPN) is then adopted to initialize the queries dynamically among different sub-images. To further enhance the fine-grained visual perceptual ability of the MLLM, we design a Multi-Level Cross-Attention (MLCA) mechanism that captures the hierarchical structure and semantic relations of document images. Furthermore, we create a new instruction-tuning dataset for document-oriented tasks by enriching the multimodal document data with Gemini Pro. We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.

References (54)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces TextHawk, a novel multimodal LLM designed for fine-grained document perception and efficient token compression.
It employs innovative components like ReSA, SPEs, QPN, and MLCA to handle high-resolution document images effectively.
Empirical results show TextHawk outperforms existing models, setting new benchmarks in document-oriented tasks.

TextHawk: Advancements in Multimodal LLMs for Document-Oriented Tasks

Introduction

The field of Multimodal LLMs (MLLMs) has significantly advanced with the advent of models capable of understanding and generating information across various modalities, notably visual and textual. Among these, document-oriented tasks stand out due to their complex nature, involving high-resolution images densely packed with information. The challenge lies in achieving fine-grained visual perception and efficient document image information compression. TextHawk emerges as a specialized MLLM, focusing on these challenges while maintaining robust general capabilities across vision and language domains.

Document-Oriented MLLMs and Their Limitations

Traditional MLLMs have ventured into enhancing fine-grained visual perception and information compression, employing methods such as increased input resolution and vision-language adapters. However, these approaches often fall short in striking a balance between general and document-specific capabilities, leaving a gap for further exploration.

TextHawk: Core Components and Innovations

TextHawk introduces four pivotal components designed to address the nuanced demands of document-oriented tasks:

ReSampling and ReArrangement (ReSA): A module that significantly compresses visual information, reducing the number of visual tokens required for document images, thus lowering computational costs.
Scalable Positional Embeddings (SPEs): Designed to encode the positions of sub-images efficiently, SPEs facilitate handling varying image sizes without losing scalability.
Query Proposal Network (QPN): This component dynamically initializes queries among different sub-images, addressing the variability inherent in document images.
Multi-Level Cross-Attention (MLCA): Enhances fine-grained visual perception by leveraging the hierarchical structure and semantic relations within document images.

Additionally, TextHawk is enriched with a novel instruction-tuning dataset tailored for document-oriented tasks, complementing its architecture designed for fine-grained perception and information compression.

Empirical Validation

TextHawk has been rigorously evaluated against both general and document-oriented MLLM benchmarks. It has demonstrated superior performance, outperforming state-of-the-art methods, substantiating its effectiveness in fine-grained document perception alongside maintaining general vision-language capabilities.

Ablation Studies and Insights

A series of ablation studies shed light on the contributions of TextHawk’s individual components:

The combination of ReSA's components leads to significant reductions in visual tokens, enabling more efficient processing of high-resolution document images.
SPEs and QPN collectively contribute to the model’s enhanced perception capabilities, accommodating the diversity and complexity of document-oriented tasks.
MLCA's ability to leverage multi-level features results in improved fine-grained perception, an essential attribute for document image understanding.

Limitations and Future Directions

While TextHawk marks a notable advancement, the freezing of the visual encoder during training points to potential areas for further exploration. Future work could involve adaptively training the vision encoder on task-specific data to refine and expand its perception capabilities.

Conclusion

TextHawk represents a significant leap forward in the specialized domain of document-oriented MLLMs. By addressing the intricate challenges of fine-grained visual perception and efficient information compression, TextHawk sets a new benchmark for future developments in the field. Its state-of-the-art performance across a wide range of benchmarks underscores its potential to pave the way for advanced document image understanding applications, bridging the gap between multimodal LLMs and the nuanced requirements of document-oriented tasks.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1780114095372472649

https://twitter.com/javaeeeee1/status/1780559115934187836