Training Question Answering Models From Synthetic Data (2002.09599v1)

Published 22 Feb 2020 in cs.CL and cs.AI

Abstract: Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of LLMs and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQuAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQuAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model. With no access to human supervision and only access to other models, we are able to train state of the art question answering networks on entirely model-generated data that achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQuAD1.1 dev set. We further apply our methodology to SQuAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.

Authors (5)

Raul Puri (12 papers)
Ryan Spring (5 papers)
Mostofa Patwary (34 papers)
Mohammad Shoeybi (60 papers)
Bryan Catanzaro (123 papers)

Citations (149)

View on Semantic Scholar

Summary

The paper demonstrates that QA models trained solely on synthetic data can achieve 88.4 EM and 93.9 F1 scores on SQuAD1.1, rivaling human-annotated datasets.
It introduces a three-step pipeline using BERT for answer extraction, a modified GPT-2 for question generation, and roundtrip consistency for effective filtration.
The study implies that high-quality synthetic data can lessen dependence on costly, limited human-labeled data, scaling QA model training across diverse domains.

Training Question Answering Models From Synthetic Data

This paper, authored by Raul Puri and colleagues, explores the potential of utilizing synthetic data to train Question Answering (QA) models, specifically addressing the performance of such models when benchmarked against human-generated datasets. The motivation arises from the significant cost and scarcity of labeled training data, which poses a substantial hindrance in the development of high-performing QA models. The authors posit that synthetic question-answer pairs, generated via LLMs, can potentially bridge the gap presently seen between synthetic and human-generated data.

The paper reports training a QA model exclusively with synthetic data generated using an 8.3 billion parameter GPT-2 model, achieving notable results of 88.4 EM and 93.9 F1 scores on the SQuAD1.1 dev set. In comparison to a baseline leveraging solely the SQuAD1.1 training set, the synthetic-only approach not only achieved comparable performance but in some cases, exceeded it. Moreover, the paper reports a substantive 2.8 point increase in EM scores for SQuAD2.0 when employing purely synthetic data, relative to previous works on synthetic data.

Key to this achievement is a three-step question generation pipeline:

Answer Generation: The paper employs a BERT-based span selection model to extract answer candidates from given text corpora. This model does not require explicit question input, facilitating a broad selection of potential answers from the text.
Question Generation: A pretrained GPT-2 model, modified for question generation, creates questions from the extracted answers. The model leverages LLMing tasks to improve the quality of generated questions, showing increased effectiveness with larger model sizes.
Question Filtration: Employing roundtrip consistency, questions are filtered using a QA model to ensure generated questions are coherent and answerable based on the generated answers. An overgeneration technique, producing multiple questions per answer, further enhances this filtration process.

The implications of this work are significant, suggesting that the reliance on human-generated datasets could be mitigated or even eliminated, allowing QA model training to scale more efficiently with synthetic question-answer pairs. This approach promises to revolutionize the manner in which QA systems are developed and trained, particularly in domains where labeled data is scarce or costly to obtain.

Practically, the implications reach beyond the SQuAD datasets, as the methodologies discussed—particularly the improved utility of large transformer models—could be extended to various types of QA datasets including open-domain, multi-hop, and conversational QA. The use of synthetic data generation holds promise in aiding other NLP tasks which require diverse heuristics and data, such as dialogue systems and information retrieval.

Finally, the paper anticipates future research exploring more advanced filtering and generation techniques, potentially examining the impact of more granular control over answer types and further scaling up LLMs. Such directions could possibly lead to QA models equivalent or superior to those trained on exhaustively curated human data, opening new avenues in AI and NLP research.

PDF Markdown

Related Papers

YouTube

Show All Videos