HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Published 19 May 2023 in cs.CL | (2305.11747v3)

Abstract: LLMs, such as ChatGPT, are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for LLMs (HaluEval), a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, i.e., sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (i.e., about $19.5\%$ responses). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. However, our experiments also prove that providing external knowledge or adding reasoning steps can help LLMs recognize hallucinations. Our benchmark can be accessed at https://github.com/RUCAIBox/HaluEval.

Abstract PDF Upgrade to Chat

Citations (178)

View on Semantic Scholar

Summary

The paper introduces HaluEval, a benchmark that systematically evaluates hallucination tendencies in LLMs using 35,000 samples across diverse tasks.
It employs a two-stage framework with automated sampling, filtering, and 5,000 human-annotated responses, revealing a 19.5% rate of generating unverifiable content in ChatGPT.
The findings underscore the need for integrating external knowledge and structured reasoning to enhance LLM accuracy and mitigate factual errors.

HaluEval: A Hallucination Evaluation Benchmark for LLMs

The paper introduces HaluEval, a large-scale benchmark designed to evaluate hallucination tendencies in LLMs such as ChatGPT. LLMs, while proficient in various NLP applications, are known to generate "hallucinations"—content that conflicts with source material or cannot be verified. This evaluation aims to explore the types and extent of hallucinations LLMs produce.

Methodology

HaluEval consists of 35,000 samples across tasks like question answering, knowledge-grounded dialogue, and text summarization. These tasks are divided into 5,000 general user queries and 30,000 task-specific examples. The benchmark employs a two-stage framework for automatic generation: sampling and filtering. ChatGPT is utilized for generating hallucinated content, which is then filtered for plausibility and difficulty.

For human annotation, the dataset includes 5,000 responses annotated with whether they contain hallucinations. These annotations guide the assessment of LLMs' recognition capabilities.

Empirical Results

The analysis reveals that ChatGPT generates unverifiable content approximately 19.5% of the time. The findings indicate LLMs struggle to detect hallucinations effectively, with ChatGPT achieving only 62.59% accuracy in question answering. Incorporating external knowledge and structured reasoning improves performance, suggesting pathways to mitigate hallucinations.

Insights and Implications

The benchmark offers a comprehensive evaluation framework that enhances understanding of hallucination patterns in LLMs. The results underscore the importance of providing LLMs with auxiliary information to refine their outputs and minimize factual errors. This research has crucial implications for deploying LLMs in sensitive applications where accuracy is paramount.

Future Directions

Further research could explore integrating dynamic knowledge retrieval systems with LLMs to address hallucinations more robustly. Additionally, expanding the benchmark to include more varied datasets and hallucination types can deepen insights and improve LLM design.

HaluEval presents a critical step towards addressing the reliability of LLMs, paving the way for future improvements in AI technology.

Markdown