Emergent Mind

Chain-of-Verification Reduces Hallucination in Large Language Models

(2309.11495)
Published Sep 20, 2023 in cs.CL and cs.AI

Abstract

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in LLMs. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Chain-of-Verification method validates user queries, correcting inaccuracies in large language model responses.

Overview

  • The paper introduces Chain-of-Verification (CoVe), a multi-step process designed to reduce hallucinations in LLMs by enabling them to self-verify and refine their responses.

  • The CoVe method includes generating a baseline response, planning verification questions, executing these verifications independently, and then producing a final verified response, demonstrating significant improvements across several well-defined tasks.

  • Experimental results show substantial precision and factual accuracy improvements in tasks like list-based questions, closed-book question answering, and longform text generation, indicating the method's effectiveness in mitigating hallucinations.

Chain-of-Verification Reduces Hallucination in LLMs

In the paper "Chain-of-Verification Reduces Hallucination in LLMs," Dhuliawala et al. address a significant challenge in LLMs: the generation of plausible yet incorrect factual information, known as hallucination. Hallucinations remain a persistent issue, especially when dealing with less common facts (torso and tail distribution facts) and in tasks requiring longform text generation. The authors propose an innovative method called Chain-of-Verification (CoVe) to tackle this problem.

Hallucinations in LLMs degrade the quality and reliability of generated content. Prior attempts to reduce hallucination include training-time corrections, generation-time corrections, and tool-augmentation methods. However, these approaches have limitations, especially when scaling the model or the training data does not suffice. The recent research shift towards integrating advanced reasoning capabilities into LLMs—such as chain-of-thought (CoT) and self-critique mechanisms—demonstrates promising directions. CoVe aligns with this trend by integrating a verification mechanism within the response generation process to mitigate hallucinations effectively.

Methodology

Chain-of-Verification (CoVe) introduces a multi-step process enabling LLMs to self-verify and refine their responses:

  1. Generate Baseline Response: The LLM first produces its initial response to a given query.
  2. Plan Verifications: The model then generates verification questions aimed at checking the factual accuracy of the original response.
  3. Execute Verifications: These verification questions are answered independently, ensuring the responses are not influenced by the initial answer.
  4. Generate Final Verified Response: The model uses the verification results to construct a revised, and ideally more accurate, final response.

The CoVe method is tested across several well-defined tasks:

  • List-Based Questions: Tasks like those generated from Wikidata and Quest categories, where precision is measured against benchmark entities.
  • MultiSpanQA: A closed-book question-answering task focusing on factoids.
  • Longform Text Generation: Biography generation is evaluated using FactScore, a metric designed to measure factual accuracy.

Experimental Results

List-Based Tasks:

Wikidata and Wiki-Category List Tasks:

  • The paper reports substantial precision improvements. For instance, the precision in the Wikidata task increases from 0.17 using the Llama 65B baseline to 0.36 using the CoVe (two-step).
  • Results indicate a marked reduction in hallucinated answers while maintaining or slightly reducing the count of correct answers.

Question Answering (MultiSpanQA):

  • F1 score improvements from 0.39 (few-shot Llama 65B) to 0.48 (CoVe factored), showcasing better precision and recall.

Longform Generation (Biographies):

  • Significant FactScore enhancement from 55.9 (few-shot Llama 65B) to 71.4 (CoVe factor+revise) comparing favorably to results from InstructGPT, ChatGPT, and PerplexityAI.
  • The breakdown indicates that especially rare facts benefit from the CoVe approach, aligning with the hypothesis that LLMs answer shortform verification questions more accurately than generating longform responses directly.

Implications and Future Work

The CoVe methodology underscores the potential of leveraging internal deliberation mechanisms to enhance the factual accuracy of LLMs without relying on external datasets or retrieval mechanisms. The consistent improvement across various tasks suggests that CoVe is a viable addition to other verification strategies in reducing hallucinations.

Potential areas for future exploration include:

  • Integrating CoVe with external tools like retrieval-augmented generation to handle cases where internal verification alone falls short.
  • Expanding the verification framework to encompass logical and opinionated content generations, beyond factual assertions.
  • Further refining the verification question generation to balance between extensive questioning and computational efficiency.

CoVe positions itself as a significant methodological advancement in the field of language model hallucination mitigation, showcasing the power of structured internal verification processes.

Given these findings, CoVe offers a valuable addition to the suite of techniques aimed at improving the reliability and accuracy of outputs from LLMs. Future work can build upon this foundation to create even more robust AI systems capable of generating highly accurate and trustworthy content across a wide array of tasks.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube