SciREX: A Challenge Dataset for Document-Level Information Extraction (2005.00512v1)

Published 1 May 2020 in cs.CL, cs.IR, and cs.LG

Abstract: Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level $N$-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX

Citations (150)

View on Semantic Scholar

Summary

The paper introduces SciREX, a novel dataset addressing document-level extraction by annotating salient entities and N-ary relations in scientific texts.
It details a dual methodology that combines automated neural labeling with human refinement to overcome complexities of contextual, multi-sentence analyses.
Empirical evaluations reveal significant performance gaps in entity saliency and relation extraction, highlighting the need for advanced document-level IE systems.

Review of "SciREX: A Challenge Dataset for Document-Level Information Extraction"

The paper introduces "SciREX," a novel dataset aimed at advancing the field of document-level Information Extraction (IE), focusing on scientific articles. Information extraction involves identifying entities and their relationships, traditionally confined to sentence or paragraph levels. SciREX aims to extend this to document-level context by furnishing a dataset that encompasses tasks such as salient entity identification and document-level $N$ -ary relation identification. This is particularly challenging due to the need to comprehend entire documents to accurately annotate entities and their relationships, which potentially span multiple sentences and sections.

The dataset leverages both automatic and human annotations, taking advantage of existing scientific knowledge resources. Automatic labeling is implemented through a neural baseline model, which is extended from current state-of-the-art IE models for document-level usage. Notably, the paper introduces a methodology that integrates both automated high-recall annotation strategies, subsequently refined by human annotators to correct errors and omissions.

The paper also presents a neural model designed to evaluate SciREX. This model employs representation learning at the section level, utilizing SciBERT as the foundational embedding tool. Salient entities are identified by evaluating the aggregate importance across document spans. Coreference resolution, typically local, is extended to a document-level context, a critical component for successful document-level IE. Relation extraction is performed by analyzing salient entity clusters to identify $N$ -ary relationships, conveying a broader understanding of entities interconnectivity in scientific discourse.

Evaluation of this neural baseline model against other competitive models demonstrates the distinct challenge presented by the SciREX dataset. The most significant performance gap is noted in identifying salient entities and their respective $N$ -ary relationships across document sections. The authors highlight that while individual components like coreference resolution and mention identification can achieve reasonable accuracy, the cascading effect of errors in earlier stages severely impacts the overall system performance in an end-to-end setting. Particularly, the difficulty in accurately identifying document-spanning salient entity clusters inhibits the efficiency of relation extraction.

The quantifiable difference between results obtained with gold annotations versus predicted inputs underscores that while automatic annotations facilitate scalability, further refinements are essential for pragmatic document-level IE. The paper also discusses the diagnostic evaluation, revealing that, although frequency of mentions is a contributing factor, a more contextual analysis to determine entity saliency would likely improve outcomes.

Practically, the dataset and insights provided in this paper could significantly influence the development of more robust, document-level IE systems necessary for deriving structured knowledge from unstructured scientific texts. Theoretically, the paper encourages exploration into more sophisticated models and architectures aimed at achieving improved document comprehension, particularly in identifying and aggregating knowledge present across multiple textual contexts.

This work establishes a foundational benchmark for document-level IE, offering rich challenges that prompt further research into entity saliency determination and holistic document understanding. Future developments might integrate advanced models capable of harnessing the rich interdependencies between document structures, potentially leading to enhanced systems with better cross-document synthesis capabilities. The authors invite the research community to tackle these challenges, progressing towards more informed AI systems capable of nuanced, document-level analysis.

PDF Markdown

Related Papers

GitHub

GitHub - allenai/SciREX: Data/Code Repository for https://api.semanticscholar.org/CorpusID:218470122 (135 stars)