End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models (1912.13415v1)

Published 20 Dec 2019 in cs.CL and cs.LG

Abstract: Named entity recognition (NER) and relation extraction (RE) are two important tasks in information extraction and retrieval (IE & IR). Recent work has demonstrated that it is beneficial to learn these tasks jointly, which avoids the propagation of error inherent in pipeline-based systems and improves performance. However, state-of-the-art joint models typically rely on external NLP tools, such as dependency parsers, limiting their usefulness to domains (e.g. news) where those tools perform well. The few neural, end-to-end models that have been proposed are trained almost completely from scratch. In this paper, we propose a neural, end-to-end model for jointly extracting entities and their relations which does not rely on external NLP tools and which integrates a large, pre-trained LLM. Because the bulk of our model's parameters are pre-trained and we eschew recurrence for self-attention, our model is fast to train. On 5 datasets across 3 domains, our model matches or exceeds state-of-the-art performance, sometimes by a large margin.

Citations (31)

View on Semantic Scholar

Summary

The paper presents an end-to-end model that leverages pre-trained transformers to perform joint NER and RE without relying on external NLP tools.
The model integrates a deep biaffine attention mechanism alongside a modified entity pretraining strategy to boost relation extraction efficiency.
Evaluations on datasets such as ACE04 and ADE show significant performance improvements, validating the model’s effectiveness.

Joint NER and RE with Pre-trained LLMs

This paper introduces a novel end-to-end model for joint named entity recognition (NER) and relation extraction (RE) that leverages pre-trained LLMs to achieve state-of-the-art performance across multiple datasets and domains. The model addresses limitations of previous joint NER and RE models, such as reliance on external NLP tools, training from scratch, and inability to parallelize training. By incorporating a pre-trained, transformer-based LLM and eschewing recurrence for self-attention, the model is fast to train and does not require hand-crafted features or external tools.

Model Architecture and Implementation

The model architecture consists of an NER module and an RE module (Figure 1). The NER module uses a pre-trained BERT model to generate contextualized word embeddings, which are then fed into a feed-forward neural network (FFNN) for NER label classification using the BIOES tagging scheme.

Figure 1: Joint named entity recognition (NER) and relation extraction (RE) model architecture.

The RE module takes the predicted entity labels from the NER module and concatenates them with the BERT hidden states to form input representations. Relation candidates are constructed from all possible combinations of the last word tokens of predicted entities. A deep biaffine attention mechanism is then employed to classify the relations between these entity pairs. The biaffine classifier uses FFNNs to project the input representations into head and tail vector representations, enabling the model to capture the directionality of relations. The model is trained end-to-end by minimizing the sum of the cross-entropy losses for the NER and RE tasks.

The model is implemented in PyTorch, utilizing the BERT\textsubscript{BASE} model from the PyTorch Transformers library. To accelerate training and reduce memory usage, NVIDIA's automatic mixed precision (AMP) library Apex is used.

Entity Pretraining Strategy

To address the issue of low-performance entity detection in the early stages of training, the paper introduces a modified entity pretraining strategy. Instead of delaying the training of the RE module, the contribution of the RE loss to the total loss is weighted during the first epoch of training. The weighting factor, $\lambda$ , is increased linearly from 0 to 1 during the first epoch and set to 1 for the remaining epochs. This approach allows the NER module to quickly achieve good performance, which in turn benefits the RE module.

Experimental Evaluation and Results

The model is evaluated on five benchmark corpora across three domains: ACE04/05, CoNLL04, ADE, and i2b2. The results demonstrate that the model matches or exceeds state-of-the-art performance for joint NER and RE on most datasets. Specifically, the model achieves substantial improvements on the ACE04 (4.59\%) and ADE (10.25\%) corpora. On the i2b2 dataset, the model's performance is compared to independent NER and RE systems, as there are no published joint models for this dataset. The ablation studies reveal that the deep biaffine attention mechanism and the entity pretraining strategy are critical for achieving maximum performance.

Figure 2: Visualization of the attention weights from select layers and heads of BERT after it was fine-tuned within our model on the CoNLL04 corpus. Darker squares indicate larger attention weights. Attention weights are shown for the input sentence: "Ruby fatally shot Oswald two days after Kennedy was assassinated.". The CLS and SEP tokens have been removed. Four major patterns are displayed: paying attention to the next word (first image from the left) and previous word (second from the left), paying attention to the word itself (third from the left) and the end of the sentence (fourth from the left).

Analysis of Attention Weights

The inclusion of a transformer-based LLM allows for the visualization and analysis of attention weights. The visualization of attention weights reveals that the model retains patterns observed in pre-trained BERT models, such as attending to the next and previous words, the word itself, and the end of the sentence.

Conclusion and Future Directions

The proposed end-to-end model offers several advantages over previous joint NER and RE models, including no reliance on hand-crafted features or external NLP tools, integration of a pre-trained LLM, and state-of-the-art performance across multiple datasets and domains. The model's modularity allows for easy adaptation to different domains and LLMs.

The paper identifies several avenues for future research, including extending the model to handle inter-sentence relations, nested entities, and multilingual corpora. The authors note that high performance on the ADE corpus may not transfer to real-world scenarios due to the corpus's simplicity.