Prompting-based Synthetic Data Generation for Few-Shot Question Answering (2405.09335v1)

Published 15 May 2024 in cs.CL

Abstract: Although LMs have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using LLMs can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that LLMs contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.

References (50)

Citations (4)

View on Semantic Scholar

Summary

The paper presents a two-step pipeline using answer sampling with NER and prompt-based question generation to create synthetic training data for few-shot QA.
It leverages T5 v1.1 and filtering mechanisms to ensure the quality and relevance of generated questions, outperforming traditional data annotation methods.
Experiments on benchmarks like SQuAD and TextbookQA show that synthetic data with 128 samples can match human-annotated quality, reducing resource needs.

Prompting-Based Synthetic Data Generation for Few-Shot Question Answering

Introduction

The paper "Prompting-based Synthetic Data Generation for Few-Shot Question Answering" (2405.09335) addresses the challenge of enhancing Question Answering (QA) performance in scenarios with limited labeled data. It leverages LLMs to generate synthetic domain-specific data, thus reducing the need for extensive data annotation. The paper focuses on extractive QA, wherein the answer is located as a span within a given context.

The authors propose a method that utilizes prompt-based data generation, arguing that pre-trained LLMs contain valuable task-agnostic and domain-agnostic knowledge that can be harnessed to improve few-shot QA models. This approach is particularly beneficial in low-resource settings, where the annotation process is resource-intensive.

Figure 1: Comparison of a) common approaches, e.g., Prompting, for MRQA and b) our approach adding synthetic task- and domain-specific data without the need of additional labeled data.

Methodology

The proposed methodology comprises two primary steps: Answer Sampling and Question Generation.

Answer Sampling

The paper utilizes Named Entity Recognition (NER) to sample potential answer spans from the context. This technique is deemed efficient as it does not require extensive domain-specific knowledge or labeled data, making it applicable across diverse domains.

Question Generation

The second step involves formulating prompts to direct the LLM to generate questions based on the sampled answers and the context. This employs the encoder-decoder architecture model, T5 v1.1, which facilitates the conditioning of output on the entire input sequence rather than merely preceding tokens. A template guides the prompt to include both the context and sampled answer, predicting the question in response.

The generation process incorporates soft tokens, initialized from pre-trained word embeddings, and a filtering mechanism to ensure the quality and relevance of the generated questions. Rule-based filtering discards nonsensical outputs, while consistency filtering ensures the generated questions align with predicted answers using a pre-trained MRQA model.

Figure 2: An example of our data generation pipeline: We first sample answer candidates (using NER) and then prompt a PLM to generate a question conditioned on context and answer (1). The generated question-answer pair is then used with the initial context to train an MRQA model (2). We afterwards perform additional training on labeled data if available.

Experimental Setup and Results

The experimental setup includes evaluations on the Few-Shot MRQA benchmark, examining multiple datasets including SQuAD. The methodology demonstrates significant performance benefits, outperforming existing state-of-the-art approaches. Notably, the paper confirms that synthetic data generated from LM prompting achieves high F1 scores, even surpassing full data settings in certain cases, like TextbookQA.

Figure 3: MRQA performance (F1) as a function of dataset sizes for the best performing approaches on the mean of all datasets in the few-shot MRQA benchmark.

Analysis

A user paper evaluated the quality of generated question-answer pairs, reflecting that data generated with 128 samples was comparable to human-annotated data in quality. This highlights the potential for reducing the annotation effort required without sacrificing data quality.

Figure 4: For the NewsQA dataset, 100 question-answer pairs were quality-assessed by humans in each setting (generated data taking 16 and 128 samples into account as well as labeled (gold) data).

Conclusion

The paper presents a compelling case for utilizing LLMs in data generation for QA tasks, demonstrating robust improvements in few-shot scenarios. The approach successfully bridges the performance gap between extensive labeled datasets and few-shot models, suggesting a paradigm shift towards using pre-trained LLMs for generating high-quality synthetic data.

Future Directions

Future research can explore leveraging LLMs for both question and answer generation, addressing the complexity of extractive QA. Additionally, the integration of feedback mechanisms and in-context learning approaches could further enhance model performance in low-resource settings.

Ethical Considerations

The research was conducted with ethical considerations, especially pertaining to the user paper, ensuring participant consent, privacy, and fair compensation through a standardized platform.

This approach presents a viable path forward in reducing dependence on costly manual annotation across diverse application domains in question answering systems.