Contrastive Entity Coreference and Disambiguation for Historical Texts (2406.15576v1)

Published 21 Jun 2024 in cs.CL, econ.GN, and q-fin.EC

Abstract: Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typically lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowledgebases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases. This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled historical newswire articles, and trained models evaluated on this historical benchmark. We contrastively train bi-encoder models for coreferencing and disambiguating individuals in historical texts, achieving accurate, scalable performance that identifies out-of-knowledgebase individuals. Our approach significantly surpasses other entity disambiguation models on our historical newswire benchmark. Our models also demonstrate competitive performance on modern entity disambiguation benchmarks, particularly certain news disambiguation datasets.

Summary

The paper introduces a contrastively trained bi-encoder model with novel training datasets that achieved 78.3% accuracy on historical benchmarks.
It constructs an extensive 190M entity pair dataset from Wikipedia, leveraging hard negatives to boost coreference and disambiguation performance.
The research enables scalable digitization of historical texts and guides the expansion of modern knowledgebases to include overlooked historical figures.

Contrastive Entity Coreference and Disambiguation for Historical Texts

The paper "Contrastive Entity Coreference and Disambiguation for Historical Texts" addresses the challenges associated with disambiguating and coreferencing individuals in massive collections of historical documents. These documents are pivotal for social science research, yet they often lack cross-document and external identifier tags for individuals mentioned within. This paper puts forth methods and contributions to bridge this gap, particularly given that existing methods fall short in accuracy when handling historical documents, which frequently include individuals not covered in modern knowledgebases like Wikipedia.

Contributions

The paper's contributions span three primary areas:

Creation of a Extensive Training Dataset (WikiConfusables):
- Constructed over 190 million entity pairs from Wikipedia contexts by leveraging disambiguation pages and contexts.
- Extracted hard negative pairs that are highly confusable using Wikipedia disambiguation pages and familial relationships mined from Wikidata.
- Built coreference and disambiguation training datasets that facilitate the development of robust models capable of handling historically referenced individuals.
Development of High-Quality Historical Evaluation Data:
- Manually labeled dataset from mid-20th century historical newswire articles was created to benchmark entity disambiguation and coreference models.
- The "Entities of the Union" benchmark provides rigorous evaluation data, including in-knowledgebase and out-of-knowledgebase entities.
Training and Evaluation of Innovative Models:
- Employs a bi-encoder retrieval architecture for model training, optimizing for scalability and accuracy. This model architecture enables efficient disambiguation by encoding mentions of the same entity closely in the embedding space.
- Achieved state-of-the-art performance on selected benchmarks, significantly exceeding the performance of existing methods on historical documents.

Methodology

The core algorithm uses a contrastively trained bi-encoder retrieval model, which is notable for its computational efficiency and accuracy:

Coreference Resolution: The LinkMentions model uses paired contexts from Wikipedia hyperlinks to train entity coreference. Using contrastive training, it groups mentions referring to the same entity by bringing them close in the embedding space, while disassociating different entities.
Disambiguation: LinkWikipedia fine-tunes this coreference model to disambiguate entities by pairing their contexts with Wikipedia entry templates. Further adaptation to historical news datasets is achieved through the LinkNewsWikipedia model.
Training Setup: Utilizes Nvidia A6000 GPUs for training, achieving scalable performance with modest computational resources.

Results

The proposed models were evaluated on a new historical benchmark, "Entities of the Union", which covers mid-20th century newswire articles, demonstrating significant improvements over traditional entity disambiguation models. Key results include:

Performance: LinkNewsWikipedia model achieved 78.3% accuracy on the historical news benchmark, compared to 65.4% by the next best model (ReFinED). The focus on handling out-of-knowledgebase individuals was pivotal in this success.
Modern Benchmarks: The models demonstrated competitive performance on modern benchmarks, particularly excelling in disambiguating news entities, evidenced by near-perfect performance on datasets like MSNBC.

Implications and Future Developments

The research has substantial implications for the field:

Practical Applications: The models can be applied to vast historical document collections, aiding in the paper of socio-historical patterns by systematically tagging and analyzing individuals.
Knowledgebase Expansion: By identifying notable but less-remembered historical figures, this research can guide the expansion of current knowledgebases like Wikipedia to cover a broader historical spectrum.
Theoretical Advances: The work highlights the importance of domain-specific training data, particularly for historical contexts, proposing efficient training methodologies like contrastive learning with hard negatives.

Limitations and Ethical Considerations

The paper acknowledges several limitations:

Ensuring comprehensive coverage of all historical individuals remains challenging since the training relies on extant knowledgebases.
Coreferencing and disambiguation accuracy, while high, may still necessitate human verification for critical applications.

Ethical considerations underline the necessity of critical interpretation of algorithmically tagged data, given the biases present in historical documents.

Conclusion

This paper makes considerable advancements in the field of entity coreference and disambiguation for historical texts. The methodologies and datasets introduced not only improve accuracy but also ensure the scalability of processing massive historical document collections, thereby presenting a significant step forward in the digitization and analysis of historical data.