PubTables-1M: Towards comprehensive table extraction from unstructured documents (2110.00061v3)

Published 30 Sep 2021 in cs.LG and cs.CV

Abstract: Recently, significant progress has been made applying machine learning to the problem of table structure inference and extraction from unstructured documents. However, one of the greatest challenges remains the creation of datasets with complete, unambiguous ground truth at scale. To address this, we develop a new, more comprehensive dataset for table extraction, called PubTables-1M. PubTables-1M contains nearly one million tables from scientific articles, supports multiple input modalities, and contains detailed header and location information for table structures, making it useful for a wide variety of modeling approaches. It also addresses a significant source of ground truth inconsistency observed in prior datasets called oversegmentation, using a novel canonicalization procedure. We demonstrate that these improvements lead to a significant increase in training performance and a more reliable estimate of model performance at evaluation for table structure recognition. Further, we show that transformer-based object detection models trained on PubTables-1M produce excellent results for all three tasks of detection, structure recognition, and functional analysis without the need for any special customization for these tasks. Data and code will be released at https://github.com/microsoft/table-transformer.

Citations (73)

View on Semantic Scholar

Summary

The paper presents a novel dataset, PubTables-1M, aggregating nearly one million scientifically sourced tables for table detection and structure recognition.
It introduces rich annotations and a canonicalization method to overcome oversegmentation, improving both functional analysis and extraction precision.
Experiments demonstrate that transformer models like DETR achieve significant improvements in precision and recall when trained with PubTables-1M.

Overview of "PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents"

The paper "PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents" presents a novel dataset, PubTables-1M, designed to enhance table structure inference and extraction from unstructured documents. This dataset significantly advances the creation of datasets featuring extensive, accurate ground truth necessary for effective machine learning model training, especially in the context of table extraction tasks.

Key Contributions

The contributions of PubTables-1M are multifaceted:

Dataset Size and Scope: PubTables-1M comprises nearly one million tables sourced from scientific articles, making it one of the largest datasets dedicated to table extraction. It addresses three critical sub-tasks: table detection (TD), table structure recognition (TSR), and functional analysis (FA).
Rich Annotations: The dataset includes comprehensive annotations such as location data for rows, columns, and cells, as well as detailed header information. This variety in annotations supports diverse modeling strategies across multiple input modalities.
Canonicalization: A novel procedure is introduced to solve the oversegmentation problem commonly seen in current datasets. Oversegmentation issues arise when a cell that should span multiple columns or rows is incorrectly divided, leading to potential ambiguities in model performance evaluation.
Quality Control: PubTables-1M contains mechanisms for automated quality verification, offering measurable assurances regarding ground truth accuracy.
Application of Transformers: The paper demonstrates that transformer-based models like the Detection Transformer (DETR) excel in tasks across detection, structure recognition, and functional analysis without task-specific customizations, achieving outstanding results in these domains when trained with PubTables-1M.

Evaluation of Results

The evaluation results underscore the dataset's robustness, especially in improving model performance in TSR and FA tasks. The paper reports significant improvements in metrics such as precision (AP) and recall (AR) when using DETR compared to traditional models like Faster R-CNN. The canonical data contributed significantly to reliable model training and evaluation, as depicted by the performance enhancements across different TSR metrics.

Implications and Future Directions

The introduction of PubTables-1M holds practical implications for improving automated table extraction, essential for data-driven industries relying on unstructured document analysis. The dataset's comprehensive annotations and canonicalization approach present a methodical enhancement over existing datasets. This improvement facilitates models that can more precisely understand table structures and their logical representations.

Looking to the future, the research opens avenues for expanding table extraction methodologies into diverse domains beyond scientific articles, such as financial documentation. Additionally, addressing challenges like accurately annotating row headers in complex tables remains a potential area for exploration. Integrating table extraction with comprehensive document understanding systems is an anticipated evolution, promising enhancements in the field of information retrieval and processing in AI.

In conclusion, PubTables-1M marks a substantial step forward in dataset quality for table extraction tasks, underpinning advancements in model training and evaluation. As the field evolves, datasets like PubTables-1M will be instrumental in driving innovation in automated document analysis.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/table-transformer: Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric. (1,923 stars)

Tweets

https://twitter.com/_akhaliq/status/1444828950082031616

https://twitter.com/arXiv__ml/status/1444962408066424841