MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations (2407.01523v3)

Published 1 Jul 2024 in cs.CV and cs.CL

Abstract: Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-LLMs (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a benchmark evaluating LVLMs using 130 diverse PDF documents and 1,062 expert-annotated questions covering cross-page and visual evidence.
It employs a three-step evaluation protocol—response generation, answer extraction, and scoring—with accuracy and F1 metrics to measure model performance.
Experimental results reveal significant challenges as top models score only 42.7% F1, emphasizing the need for improved visual and long-context comprehension.

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

MMLongBench-Doc is a newly introduced benchmark aimed at evaluating the capabilities of Large Vision-LLMs (LVLMs) in the context of long-form, multi-modal document understanding. The benchmark addresses significant gaps in previous datasets, which predominantly focused on short, single-page documents, thereby limiting the scope of their evaluations.

Key Contributions

Dataset Construction:
- The dataset comprises 130 PDF-formatted documents with an average of 49.4 pages and 20,970.9 textual tokens. These documents come from diverse sources such as research reports, financial reports, academic papers, brochures, and guidelines, ensuring a well-rounded evaluation metric.
- A distinctive feature is the inclusion of 1,062 expert-annotated questions that require evidence not only from textual content but also from images, charts, tables, and layout structures. Furthermore, 33.2% of these questions are cross-page questions, necessitating comprehension across multiple pages, and 22.8% are designed to be unanswerable, testing the models' hallucination detection capabilities.
Evaluation Metrics:
- The benchmark uses a combination of generalized accuracy and F1 score to provide a nuanced evaluation across different question types and evidence sources.
- The benchmark methodology comprises a three-step evaluation protocol: response generation, answer extraction, and score calculation. This pipeline ensures a high correlation between human judgment and automatic evaluation, with comprehensive rules for formatted answers including strings, integers, floats, and lists.
Comparison with Previous Datasets:
- In contrast to previous datasets like DocVQA, ChartQA, and SlideVQA, which primarily focused on single-page documents or those of limited complexity, MMLongBench-Doc stands out by its complexity and diversity. Document lengths, page numbers, and token densities in MMLongBench-Doc are significantly higher, thereby pushing the boundaries of existing LVLM capabilities.

Experimental Results

The paper conducts extensive experiments evaluating 14 LVLMs and 10 LLMs. The results compellingly reveal that long-context document understanding presents substantial challenges to current state-of-the-art models:

Performance Indicators:
- The best-performing model, GPT-4o, achieved an F1 score of only 42.7%. This performance starkly contrasts with traditional document understanding tasks where models often exceed 90% accuracy.
- Interestingly, many LVLMs perform worse than their LLM counterparts when dealing with OCR-parsed texts, underlining the difficulty in processing multi-modal, multi-page documents effectively.
Error Analysis:
- Most errors were attributed to hallucinated evidence, perceptual inaccuracies, and struggles with gathering complete evidence for cross-page questions. For instance, GPT-4o frequently attempted to provide answers even when they were unanswerable, leading to higher hallucination rates.

Implications and Future Directions

The pronounced challenges highlighted by MMLongBench-Doc underscore the necessity for more robust LVLM architectures capable of long-context comprehension. The benchmark reveals specific areas where existing models falter, such as:

Perceptual Capability: Enhancing the visual perception of models to accurately interpret images, charts, and complex layouts.
Cross-page Comprehension: Developing mechanisms for effective global searching and information aggregation across multiple document pages.
Hallucination Mitigation: Improving the models' ability to recognize when a question is unanswerable to reduce false positives.

Moving forward, the dataset could be instrumental in guiding the next generation of research in multi-modal long-context document understanding. Enhancing the pre-training corpus with more diverse and complex long-form documents, along with fine-tuning strategies, may bridge the performance gap identified in this paper. Furthermore, practical applications of these advancements could span various domains including legal document analysis, scientific literature reviews, and large-scale financial report audits.

Conclusion

MMLongBench-Doc represents a significant advancement in evaluating the document understanding capabilities of LVLMs, particularly in the context of long, multi-modal documents. By identifying explicit challenges and providing a robust benchmark, this work paves the way for future developments that could significantly enhance the capabilities of LVLMs in practical, real-world applications.

PDF Markdown

Related Papers

GitHub

MMLongBench-Doc
mayubo2333/MMLongBench-Doc · GitHub (55 stars)

Tweets

https://twitter.com/jobergum/status/1852453261321597135

https://twitter.com/jobergum/status/1838870156753215574

https://twitter.com/mayubo2333/status/1811015423921639855

https://twitter.com/AixinSG/status/1812654523078578507

https://twitter.com/AixinSG/status/1812069239991480406

https://twitter.com/AixinSG/status/1836526776601972815

YouTube

Show All Videos