Emergent Mind

Abstract

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

MMLongBench-Doc evaluates LVLMs on lengthy, multi-modal documents; results show many LVLMs struggle.

Overview

  • MMLongBench-Doc is a new benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) for understanding long-form, multi-modal documents.

  • The dataset includes 130 diverse PDF documents and 1,062 expert-annotated questions, with significant portions requiring cross-page comprehension and hallucination detection.

  • Experimental results highlight substantial challenges for current state-of-the-art models, especially in handling multi-modal, multi-page documents, with key difficulties in perceptual accuracy and hallucination mitigation.

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

MMLongBench-Doc is a newly introduced benchmark aimed at evaluating the capabilities of Large Vision-Language Models (LVLMs) in the context of long-form, multi-modal document understanding. The benchmark addresses significant gaps in previous datasets, which predominantly focused on short, single-page documents, thereby limiting the scope of their evaluations.

Key Contributions

Dataset Construction:

  • The dataset comprises 130 PDF-formatted documents with an average of 49.4 pages and 20,970.9 textual tokens. These documents come from diverse sources such as research reports, financial reports, academic papers, brochures, and guidelines, ensuring a well-rounded evaluation metric.
  • A distinctive feature is the inclusion of 1,062 expert-annotated questions that require evidence not only from textual content but also from images, charts, tables, and layout structures. Furthermore, 33.2% of these questions are cross-page questions, necessitating comprehension across multiple pages, and 22.8% are designed to be unanswerable, testing the models' hallucination detection capabilities.

Evaluation Metrics:

  • The benchmark uses a combination of generalized accuracy and F1 score to provide a nuanced evaluation across different question types and evidence sources.
  • The benchmark methodology comprises a three-step evaluation protocol: response generation, answer extraction, and score calculation. This pipeline ensures a high correlation between human judgment and automatic evaluation, with comprehensive rules for formatted answers including strings, integers, floats, and lists.

Comparison with Previous Datasets:

  • In contrast to previous datasets like DocVQA, ChartQA, and SlideVQA, which primarily focused on single-page documents or those of limited complexity, MMLongBench-Doc stands out by its complexity and diversity. Document lengths, page numbers, and token densities in MMLongBench-Doc are significantly higher, thereby pushing the boundaries of existing LVLM capabilities.

Experimental Results

The paper conducts extensive experiments evaluating 14 LVLMs and 10 LLMs. The results compellingly reveal that long-context document understanding presents substantial challenges to current state-of-the-art models:

Performance Indicators:

  • The best-performing model, GPT-4o, achieved an F1 score of only 42.7%. This performance starkly contrasts with traditional document understanding tasks where models often exceed 90% accuracy.
  • Interestingly, many LVLMs perform worse than their LLM counterparts when dealing with OCR-parsed texts, underlining the difficulty in processing multi-modal, multi-page documents effectively.

Error Analysis:

  • Most errors were attributed to hallucinated evidence, perceptual inaccuracies, and struggles with gathering complete evidence for cross-page questions. For instance, GPT-4o frequently attempted to provide answers even when they were unanswerable, leading to higher hallucination rates.

Implications and Future Directions

The pronounced challenges highlighted by MMLongBench-Doc underscore the necessity for more robust LVLM architectures capable of long-context comprehension. The benchmark reveals specific areas where existing models falter, such as:

  • Perceptual Capability: Enhancing the visual perception of models to accurately interpret images, charts, and complex layouts.
  • Cross-page Comprehension: Developing mechanisms for effective global searching and information aggregation across multiple document pages.
  • Hallucination Mitigation: Improving the models' ability to recognize when a question is unanswerable to reduce false positives.

Moving forward, the dataset could be instrumental in guiding the next generation of research in multi-modal long-context document understanding. Enhancing the pre-training corpus with more diverse and complex long-form documents, along with fine-tuning strategies, may bridge the performance gap identified in this study. Furthermore, practical applications of these advancements could span various domains including legal document analysis, scientific literature reviews, and large-scale financial report audits.

Conclusion

MMLongBench-Doc represents a significant advancement in evaluating the document understanding capabilities of LVLMs, particularly in the context of long, multi-modal documents. By identifying explicit challenges and providing a robust benchmark, this work paves the way for future developments that could significantly enhance the capabilities of LVLMs in practical, real-world applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube