ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages (2403.17859v2)

Published 26 Mar 2024 in cs.CL

Abstract: Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, LLMs. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train LLMs. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale temporal QA dataset with 487K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces ChroniclingAmericaQA, a dataset with 485K question-answer pairs spanning 120 years of historical American newspapers with inherent OCR noise.
It details a robust methodology that uses raw and GPT-corrected OCR text along with scanned images to generate high-quality QA pairs using the T5-base model.
The paper demonstrates that fine-tuning modern models including BERT, RoBERTa, and LLMs like LLaMA2 yields improved QA performance on historically noisy data.

ChroniclingAmericaQA: Advancing Question Answering Research with Historical Newspaper Collections

Introduction to ChroniclingAmericaQA

The field of Question Answering (QA) has experienced notable advances owing to the advent of deep learning and, more specifically, the development of LLMs. However, a prevalent limitation within current QA research is its focus on modern textual data, overlooking the rich tapestry of historical documents available. The ChroniclingAmericaQA dataset seeks to address this gap by leveraging the Chronicling America historical newspaper collection, resulting in a novel QA dataset comprising 485K question-answer pairs derived from documents spanning 120 years (1800-1920). This dataset not only expands the temporal scope of QA research but also introduces the challenge of working with noisy OCR text, a common issue in digital historical document collections.

Dataset Construction and Challenges

Creating the ChroniclingAmericaQA dataset involved a meticulous process to convert the historical newspapers into a format suitable for QA research. Key challenges included the handling of noisy OCR text, which often hampers the extraction of accurate information. To mitigate this, a hybrid approach was adopted that involved both raw and corrected OCR text, allowing for comprehensive model testing. Moreover, by incorporating scanned images of the newspaper pages, the dataset promotes research into multimodal QA systems capable of interpreting both textual and visual data.

The dataset's construction process can be summarized into three critical steps:

Data Collection: A diverse selection of newspaper pages was curated from the Chronicling America project, ensuring a wide geographic and temporal coverage.
Data Preparation: The OCR text underwent a crucial correction phase employing GPT 3.5 Turbo, enhancing the text's quality for better question generation.
Question Generation Module: Utilizing the T5-base model, question-answer pairs were generated, highlighting the ability of generative models to produce coherent and relevant QA pairs even from complex historical texts.

Dataset Characteristics

ChroniclingAmericaQA distinguishes itself by its longitudinal coverage and the inclusion of noisy OCR text scenarios, offering a unique resource for QA research. It stands as the most extended QA dataset of its kind, spanning over a century of content. This breadth not only introduces the challenge of language evolution over time but also tests a model's ability to discern information amidst the inherent inaccuracies of historical OCR text.

Evaluation and Insights

Evaluation of ChroniclingAmericaQA involved testing with various models including, but not limited to BERT, RoBERTa, and T5, alongside the emerging LLMs like LLaMA2 and Mistral. Highlighting a few key insights:

Performance Degradation with Noisy OCR: There's a noticeable performance drop when models are tested with raw OCR text compared to corrected text, underscoring the importance of text quality in historical QA tasks.
Model Adaptability: Models fine-tuned on both the ChroniclingAmericaQA and other QA datasets like SQuAD demonstrated superior performance, suggesting the benefit of a diverse training regimen that includes both modern and historical texts.
Value of LLMs: Advanced models like LLaMA2 showcased remarkable resilience against the challenges posed by the dataset, indicating the potential of LLMs in historical document QA research.

Practical Implications and Future Directions

The introduction of ChroniclingAmericaQA paves the way for a new direction in QA research, emphasizing the untapped potential of historical document collections. Beyond academia, this dataset has practical applications in digital humanities, archival science, and education, facilitating access and understanding of historical documents through advanced QA systems.

Future endeavors may extend the ChroniclingAmericaQA framework to other historical document collections, further enriching the resources available for QA research. Moreover, tackling the challenge of bias and ethical considerations in historical texts through advanced model training presents a crucial area for further investigation.

Conclusion

In summary, the ChroniclingAmericaQA dataset marks a significant step forward in the quest to extend QA and MRC tasks to historical documents. By bridging the gap between modern textual analyses and the rich informational content of historical archives, it lays the groundwork for a more inclusive and comprehensive approach to QA research.

PDF Markdown