Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 177 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 119 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 439 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

DARE: Data Augmented Relation Extraction with GPT-2 (2004.13845v1)

Published 6 Apr 2020 in cs.CL, cs.LG, and stat.ML

Abstract: Real-world Relation Extraction (RE) tasks are challenging to deal with, either due to limited training data or class imbalance issues. In this work, we present Data Augmented Relation Extraction(DARE), a simple method to augment training data by properly fine-tuning GPT-2 to generate examples for specific relation types. The generated training data is then used in combination with the gold dataset to train a BERT-based RE classifier. In a series of experiments we show the advantages of our method, which leads in improvements of up to 11 F1 score points against a strong base-line. Also, DARE achieves new state of the art in three widely used biomedical RE datasets surpassing the previous best results by 4.7 F1 points on average.

Citations (73)

View on Semantic Scholar

Summary

The paper introduces DARE, which augments scarce training data using GPT-2 to improve relation extraction performance.
It employs a two-step strategy by fine-tuning GPT-2 per relation type to generate synthetic examples that bolster BERT-based classifier training.
Experiments on biomedical datasets show up to 11 F1 point improvements, setting new state-of-the-art benchmarks in relation extraction.

DARE: Data Augmented Relation Extraction with GPT-2

The paper introduces a novel approach for enhancing Relation Extraction (RE) tasks through data augmentation using GPT-2, named DARE (Data Augmented Relation Extraction). RE tasks are integral to identifying semantic relationships between entities in text, yet these tasks often face challenges related to training data scarcity or class imbalance. This research paper proposes a method to mitigate these issues by leveraging GPT-2's ability to generate synthetic training examples for specific relation types. The augmented datasets are subsequently utilized to enhance the performance of BERT-based RE classifiers.

Methodology

The authors employ a two-step strategy for data augmentation. Initially, they fine-tune a pre-trained GPT-2 model on individual relation types within a RE dataset. Each fine-tuned GPT-2 model generates new training samples specific to its relation type. Subsequently, these synthetically generated datasets are amalgamated with the original gold-standard datasets to train BERT-based RE classifiers. The paper employs an ensemble of classifiers trained on various subsets of the generated data combined with gold-standard data, thus addressing the noise typical in generated samples and increasing robustness.

Experimental Evaluation

The method was tested on three biomedical RE datasets: CDR, DDI2013, and ChemProt, which encapsulate varying degrees of class imbalance and limited positive samples. The experiments demonstrated remarkable improvements in classification performance. Specifically, DARE improved F1 scores by up to 11 points compared to strong baselines in extremely unbalanced datasets. Additionally, DARE achieved new state-of-the-art results across all three datasets, surpassing existing benchmarks by an average of 4.7 F1 points.

Implications and Future Directions

The implications of this research are significant for the development and future utilization of RE systems, particularly in domains where data availability is a persistent challenge. By automating the generation of diverse training data without reliance on domain expertise or manually curated augmentations, DARE presents a scalable solution for enhancing text classification tasks. The potential theoretical contributions include refined techniques in text data generation while integrating and balancing classifier ensembles.

Future work may explore applying similar data augmentation techniques to other natural language processing tasks, adjusting GPT-2 fine-tuning methods or exploring alternative architectures. Expanding experiments to domains beyond biomedical texts could validate the versatility of the approach. Further refinement in controlling generated data quality or noise reduction strategies could enhance the applicability of synthetic data in training robust classifiers.

In summary, DARE's integration of data augmentation through GPT-2 offers a promising enhancement to RE tasks, providing a substantial performance boost in scenarios plagued by class imbalance and limited data, signaling a notable advancement in text data augmentation strategies.