Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Training Question Answering Models From Synthetic Data (2002.09599v1)

Published 22 Feb 2020 in cs.CL and cs.AI

Abstract: Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of LLMs and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQuAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQuAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model. With no access to human supervision and only access to other models, we are able to train state of the art question answering networks on entirely model-generated data that achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQuAD1.1 dev set. We further apply our methodology to SQuAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.

Citations (149)

Summary

  • The paper demonstrates that QA models trained solely on synthetic data can achieve 88.4 EM and 93.9 F1 scores on SQuAD1.1, rivaling human-annotated datasets.
  • It introduces a three-step pipeline using BERT for answer extraction, a modified GPT-2 for question generation, and roundtrip consistency for effective filtration.
  • The study implies that high-quality synthetic data can lessen dependence on costly, limited human-labeled data, scaling QA model training across diverse domains.

Training Question Answering Models From Synthetic Data

This paper, authored by Raul Puri and colleagues, explores the potential of utilizing synthetic data to train Question Answering (QA) models, specifically addressing the performance of such models when benchmarked against human-generated datasets. The motivation arises from the significant cost and scarcity of labeled training data, which poses a substantial hindrance in the development of high-performing QA models. The authors posit that synthetic question-answer pairs, generated via LLMs, can potentially bridge the gap presently seen between synthetic and human-generated data.

The paper reports training a QA model exclusively with synthetic data generated using an 8.3 billion parameter GPT-2 model, achieving notable results of 88.4 EM and 93.9 F1 scores on the SQuAD1.1 dev set. In comparison to a baseline leveraging solely the SQuAD1.1 training set, the synthetic-only approach not only achieved comparable performance but in some cases, exceeded it. Moreover, the paper reports a substantive 2.8 point increase in EM scores for SQuAD2.0 when employing purely synthetic data, relative to previous works on synthetic data.

Key to this achievement is a three-step question generation pipeline:

  1. Answer Generation: The paper employs a BERT-based span selection model to extract answer candidates from given text corpora. This model does not require explicit question input, facilitating a broad selection of potential answers from the text.
  2. Question Generation: A pretrained GPT-2 model, modified for question generation, creates questions from the extracted answers. The model leverages language modeling tasks to improve the quality of generated questions, showing increased effectiveness with larger model sizes.
  3. Question Filtration: Employing roundtrip consistency, questions are filtered using a QA model to ensure generated questions are coherent and answerable based on the generated answers. An overgeneration technique, producing multiple questions per answer, further enhances this filtration process.

The implications of this work are significant, suggesting that the reliance on human-generated datasets could be mitigated or even eliminated, allowing QA model training to scale more efficiently with synthetic question-answer pairs. This approach promises to revolutionize the manner in which QA systems are developed and trained, particularly in domains where labeled data is scarce or costly to obtain.

Practically, the implications reach beyond the SQuAD datasets, as the methodologies discussed—particularly the improved utility of large transformer models—could be extended to various types of QA datasets including open-domain, multi-hop, and conversational QA. The use of synthetic data generation holds promise in aiding other NLP tasks which require diverse heuristics and data, such as dialogue systems and information retrieval.

Finally, the paper anticipates future research exploring more advanced filtering and generation techniques, potentially examining the impact of more granular control over answer types and further scaling up LLMs. Such directions could possibly lead to QA models equivalent or superior to those trained on exhaustively curated human data, opening new avenues in AI and NLP research.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube