- The paper presents an end-to-end model that leverages pre-trained transformers to perform joint NER and RE without relying on external NLP tools.
- The model integrates a deep biaffine attention mechanism alongside a modified entity pretraining strategy to boost relation extraction efficiency.
- Evaluations on datasets such as ACE04 and ADE show significant performance improvements, validating the model’s effectiveness.
Joint NER and RE with Pre-trained LLMs
This paper introduces a novel end-to-end model for joint named entity recognition (NER) and relation extraction (RE) that leverages pre-trained LLMs to achieve state-of-the-art performance across multiple datasets and domains. The model addresses limitations of previous joint NER and RE models, such as reliance on external NLP tools, training from scratch, and inability to parallelize training. By incorporating a pre-trained, transformer-based LLM and eschewing recurrence for self-attention, the model is fast to train and does not require hand-crafted features or external tools.
Model Architecture and Implementation
The model architecture consists of an NER module and an RE module (Figure 1). The NER module uses a pre-trained BERT model to generate contextualized word embeddings, which are then fed into a feed-forward neural network (FFNN) for NER label classification using the BIOES tagging scheme.
Figure 1: Joint named entity recognition (NER) and relation extraction (RE) model architecture.
The RE module takes the predicted entity labels from the NER module and concatenates them with the BERT hidden states to form input representations. Relation candidates are constructed from all possible combinations of the last word tokens of predicted entities. A deep biaffine attention mechanism is then employed to classify the relations between these entity pairs. The biaffine classifier uses FFNNs to project the input representations into head and tail vector representations, enabling the model to capture the directionality of relations. The model is trained end-to-end by minimizing the sum of the cross-entropy losses for the NER and RE tasks.
The model is implemented in PyTorch, utilizing the BERT\textsubscript{BASE} model from the PyTorch Transformers library. To accelerate training and reduce memory usage, NVIDIA's automatic mixed precision (AMP) library Apex is used.
Entity Pretraining Strategy
To address the issue of low-performance entity detection in the early stages of training, the paper introduces a modified entity pretraining strategy. Instead of delaying the training of the RE module, the contribution of the RE loss to the total loss is weighted during the first epoch of training. The weighting factor, λ, is increased linearly from 0 to 1 during the first epoch and set to 1 for the remaining epochs. This approach allows the NER module to quickly achieve good performance, which in turn benefits the RE module.
Experimental Evaluation and Results
The model is evaluated on five benchmark corpora across three domains: ACE04/05, CoNLL04, ADE, and i2b2. The results demonstrate that the model matches or exceeds state-of-the-art performance for joint NER and RE on most datasets. Specifically, the model achieves substantial improvements on the ACE04 (4.59\%) and ADE (10.25\%) corpora. On the i2b2 dataset, the model's performance is compared to independent NER and RE systems, as there are no published joint models for this dataset. The ablation studies reveal that the deep biaffine attention mechanism and the entity pretraining strategy are critical for achieving maximum performance.
Figure 2: Visualization of the attention weights from select layers and heads of BERT after it was fine-tuned within our model on the CoNLL04 corpus. Darker squares indicate larger attention weights. Attention weights are shown for the input sentence: "Ruby fatally shot Oswald two days after Kennedy was assassinated.". The CLS and SEP tokens have been removed. Four major patterns are displayed: paying attention to the next word (first image from the left) and previous word (second from the left), paying attention to the word itself (third from the left) and the end of the sentence (fourth from the left).
Analysis of Attention Weights
The inclusion of a transformer-based LLM allows for the visualization and analysis of attention weights. The visualization of attention weights reveals that the model retains patterns observed in pre-trained BERT models, such as attending to the next and previous words, the word itself, and the end of the sentence.
Conclusion and Future Directions
The proposed end-to-end model offers several advantages over previous joint NER and RE models, including no reliance on hand-crafted features or external NLP tools, integration of a pre-trained LLM, and state-of-the-art performance across multiple datasets and domains. The model's modularity allows for easy adaptation to different domains and LLMs.
The paper identifies several avenues for future research, including extending the model to handle inter-sentence relations, nested entities, and multilingual corpora. The authors note that high performance on the ADE corpus may not transfer to real-world scenarios due to the corpus's simplicity.