SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Published 15 Mar 2023 in cs.CL | (2303.08896v3)

Abstract: Generative LLMs such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose "SelfCheckGPT", a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database. SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages. We demonstrate that SelfCheckGPT can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. We compare our approach to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (309)

View on Semantic Scholar

Summary

The paper presents SelfCheckGPT as a novel method that detects hallucinated outputs in generative LLMs using a zero-resource, black-box strategy.
It implements various techniques including BERTScore, QA, n-gram, NLI, and prompt-based evaluations to measure output consistency.
Experimental results show that the prompt and NLI variants achieve superior AUC-PR scores and strong correlations with human factuality judgments.

Analysis of SelfCheckGPT for Hallucination Detection in LLMs

The paper in review introduces "SelfCheckGPT," a novel approach for identifying hallucinated outputs from LLMs with a black-box architecture. This work addresses the issue prevalent in generative LLMs, such as GPT-3, whereby models produce fluent yet factually incorrect content, known as hallucinations. Unlike existing methodologies, SelfCheckGPT operates in a zero-resource and black-box setting, eschewing the need for an external database or internal probability distributions. Instead, it assesses the factuality of model outputs by evaluating the consistency across multiple stochastically generated responses.

Key Contributions

The paper offers significant insights into hallucination detection using a sampling-based technique that involves several variations:

SelfCheckGPT with BERTScore: Measures sentence consistency using BERTScore, checking similarity to the most alike sentence in sampled outputs.
SelfCheckGPT with Question Answering (QA): Generates questions from the main response and verifies them using sampled passages, enhancing the detection of inconsistencies.
SelfCheckGPT with n-gram Models: Constructs n-gram models from sampled texts to estimate token probabilities, aiming to uncover improbable continuations as potential hallucinations.
SelfCheckGPT with NLI: Utilizes Natural Language Inference models to ascertain contradictions between sampled sentences and the original response.
SelfCheckGPT with Prompt: Directly prompts another instance of the LLM to assess textual consistency, leveraging the model's own interpretative power to distinguish factual from hallucinated statements.

Experimental Results

The study rigorously evaluates SelfCheckGPT against baseline techniques using a dataset derived from GPT-3 outputs on WikiBio entries. Notably, SelfCheckGPT—in particular the Prompt and NLI variants—exhibited superior performance in detecting hallucinations compared to both existing grey-box and other baseline methods. The analyses also indicate that SelfCheckGPT efficiently ranks the factuality of passages, achieving high Pearson and Spearman correlation values relative to human judgments.

This superiority is evident as the method yields the highest AUC-PR scores across most detection and ranking tasks. Specifically, the prompt-based variant, while computationally intensive, provides the most precise detection, demonstrating the potential of leveraging the generator's own capabilities for introspective evaluation.

Implications and Future Work

The paper suggests practical and theoretical implications. Practically, these findings hold promise for enhancing the reliability of AI systems by reducing the incidence of misinformation resulting from hallucinated outputs. Theoretically, the concept of self-sampling and consistency checking could be foundational for future developments in unsupervised evaluation techniques for various generative models.

Future development could include refining SelfCheckGPT to function with fewer samples or leveraging more computationally efficient LLMs. Additionally, expanding evaluation across diverse datasets and model architectures could provide broader insights into the general applicability of these hallucination detection techniques.

Conclusion

This work positions "SelfCheckGPT" as a pivotal stride toward mitigating the unreliability of AI-generated content due to hallucinations. By employing black-box solutions, it significantly broadens the horizon for real-world applications, where such models are often accessed only through limited APIs. As generative technologies continue to mature, approaches like SelfCheckGPT offer a crucial contribution to aligning output quality with user expectations for factual integrity.

Markdown Report Issue