DocVQA: A Dataset for VQA on Document Images

Published 1 Jul 2020 in cs.CV and cs.IR | (2007.00398v3)

Abstract: We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa.org

Abstract PDF Upgrade to Chat

Citations (479)

View on Semantic Scholar

Summary

The paper introduces a novel dataset with 50,000 questions over 12,000 document images to advance document-based VQA research.
It demonstrates that traditional VQA models struggle with document layouts, emphasizing the need for integrated text and spatial reasoning.
BERT-based models show promise on DocVQA, highlighting the potential of NLP techniques in improving document comprehension accuracy.

An Expert Analysis of "DocVQA: A Dataset for VQA on Document Images"

"DocVQA: A Dataset for VQA on Document Images" presents a significant contribution to the field of Visual Question Answering (VQA) by introducing a new dataset tailored to document images. This paper addresses a niche yet vital aspect of VQA, where the primary focus is on understanding and interpreting document images. Unlike traditional VQA datasets focused on scene images, DocVQA emphasizes cognitive processes that leverage both textual and visual elements found in structured documents.

The dataset contains 50,000 questions pertaining to over 12,000 document images, a scale that promises to spur advancements in this specific VQA domain. Readers familiar with machine reading comprehension and traditional VQA can appreciate the distinct challenges introduced by document images, which include handling dense semantic information and exploiting layout, tables, graphs, and other document-specific features. Notably, the dataset embraces a range of document types, including forms, reports, handwritten notes, and figures, sourced from historical collections which span multiple decades.

An initial baseline evaluation of existing VQA and reading comprehension models demonstrates the inherent complexity of the tasks posed by DocVQA. Despite state-of-the-art models like LoRRA and M4C being designed to decipher text within scene images, their application to document contexts yields underwhelming results, with accuracy significantly trailing human benchmarks (94.36%). Unlike general VQA datasets, the effectiveness of these models is limited due to their reliance on object detection components, which are not directly applicable in the domain of document images, where textual layout and spatial arrangements take precedence.

The authors' implementation of heuristics and various upper-bound estimates provides a comprehensive analysis of the baseline models' capabilities, revealing that substantial improvements are attainable. Specifically, their studies underscore that successful integration of textual recognition and spatial reasoning components could narrow the performance gap with human respondents.

Leveraging BERT-based models for this task highlights the potential of natural language processing techniques in enhancing VQA frameworks. Interestingly, BERT models fine-tuned on the DocVQA dataset outperform traditional VQA models, achieving a notable accuracy of 55.77%. However, the scope for improvement remains vast, particularly in generating sophisticated models that understand dynamic visual contexts and handwritten text better.

In speculation, this research could substantially impact practical applications such as automated document processing, intelligent information retrieval, and document comprehension systems. Future exploration into purpose-driven document analysis may yield models that exhibit both granular text interpretation and a macroscopic understanding of document architecture.

The "DocVQA" paper sets a precedent by bridging document analysis with visual cognition, advocating for an integrated approach that marries low-level feature extraction with a high-level purpose-oriented understanding. This is pivotal in achieving holistic document processing solutions in increasingly digitized environments.

Markdown Report Issue