Retrieving Supporting Evidence for Generative Question Answering (2309.11392v1)

Published 20 Sep 2023 in cs.IR

Abstract: Current LLMs can exhibit near-human levels of performance on many natural language-based tasks, including open-domain question answering. Unfortunately, at this time, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report two simple experiments to automatically validate generated answers against a corpus. We base our experiments on questions and passages from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of sparse retrieval, dense retrieval and neural rerankers. In the first experiment, we validate the generated answer in its entirety. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. In the second experiment, we consider the generated answer at a more granular level, prompting the LLM to extract a list of factual statements from the answer and verifying each statement separately. We query the corpus with each factual statement and then present the LLM with the statement and the corresponding retrieved evidence. The LLM is prompted to indicate if the statement can be supported and make necessary edits using the retrieved material. With an accuracy of over 80%, we find that an LLM is capable of verifying its generated answer when a corpus of supporting material is provided. However, manual assessment of a random sample of questions reveals that incorrect generated answers are missed by this verification process. While this verification process can reduce hallucinations, it can not entirely eliminate them.

Citations (24)

View on Semantic Scholar

Summary

The paper proposes a novel approach to reduce hallucinations by verifying LLM-generated answers through retrieving supporting evidence.
It utilizes both sparse (BM25) and dense retrieval methods, achieving an 80% accuracy in self-verifying answers in an open-domain setting.
The findings highlight limitations in retrieval efficiency and prompt design, suggesting further research is needed to enhance LLM reliability.

Retrieving Supporting Evidence for Generative Question Answering

The paper "Retrieving Supporting Evidence for Generative Question Answering" (2309.11392) addresses a significant challenge in the advancement of LLMs - their tendency to produce hallucinated answers which can appear convincing even when incorrect. This research explores whether LLMs can self-verify their generated answers against an external corpus using retrieval methods. Specifically, it examines the degree to which LLMs hallucinate answers in an open-domain setting and proposes methodologies for automatic answer validation using sparse and dense retrieval pipelines.

Introduction and Background

Recent improvements in NLP via transformer-based LLMs, such as BERT and GPT-3, have significantly advanced text generation tasks, including question answering. Despite these advancements, LLMs are prone to generating convincing yet inaccurate information, known as hallucinations. This paper tackles the critical issue of hallucination, especially in sensitive domains like healthcare, by validating LLM-generated answers against external sources.

The authors utilize information retrieval (IR) approaches, which rapidly locate relevant documents from large corpora, to address hallucinations. The retrieval-augmented generation approach conditions text generation on retrieved documents, but this method still suffers from hallucinations. This research explores retrieval after generation, where the LLM verifies its output against supporting evidence.

Methodology

The paper employs two experiments conducted on the MS MARCO (V1) test collection, utilizing questions and passages therein for validation. The first experiment assesses generated answers in their entirety by querying a combination of questions and answers against a corpus using sparse and dense retrieval techniques. The setup involves two retrieval methods: the Okapi BM25 ranking function for sparse retrieval and a neural retrieval pipeline for dense retrieval. Validation occurs as the LLM compares generated answers with retrieved passages.

Figure 1: Self-detecting hallucination in LLMs.

The second experiment evaluates the generated answers at a more granular level, extracting factual statements from the generated answers and verifying each statement individually. This involves decomposing text into atomic assertions and validating each against retrieved evidence.

Figure 2: Overview of fact-based self-detecting hallucination in LLMs.

Results

The paper reports an 80% accuracy of LLMs in verifying their own answers when given supporting passages. However, the verification process only reduces hallucination but does not eliminate it. The experiments demonstrate the efficacy of combining generation with post-generation verification, suggesting that LLMs can self-detect hallucinations by leveraging retrieval systems.

Figure 3 illustrates the classification steps involved in validating answers, emphasizing that the overall structure of the original text is maintained despite contradictions during post-editing.

Figure 3: Stepped classification of a question-answers pair.

Furthermore, discrepancies in manual labeling revealed that retrieval inefficiencies and prompt limitations can lead to missed hallucinations, necessitating further research into improved methodologies for self-verification.

Conclusion

This research identifies and tests methodologies for verifying LLM-generated answers, marking a step towards reducing hallucination in generative question answering. While the paper's proposed techniques effectively reduce hallucinations, they do not entirely eliminate them. Future work could involve more sophisticated prompt engineering, fine-tuning LLMs, and experimenting with various retrieval systems.

The implications of this paper extend to enhancing the reliability of LLMs in critical applications. Continued research in this area is vital to developing high-assurance generative models that consistently produce accurate and trustworthy outputs.