Emergent Mind

Lynx: An Open Source Hallucination Evaluation Model

(2407.08488)
Published Jul 11, 2024 in cs.AI and cs.CL

Abstract

Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in LLMs. However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.

LLM-as-a-judge: GPT-4o, Claude-3-Sonnet, and Lynx (70B) responses for a HaluEval Question Answering example.

Overview

  • The paper introduces 'Lynx,' a state-of-the-art hallucination detection model, along with a comprehensive benchmark called 'HaluBench' to evaluate hallucinations in LLMs.

  • Lynx aims to address the limitations of current Retrieval Augmented Generation (RAG) systems by improving the faithfulness of outputs through advanced reasoning and extensive evaluation datasets.

  • The model, fine-tuned with reasoning chains, achieves superior performance on HaluBench compared to other leading models, proving its efficacy in diverse and specialized domains like finance and medicine.

Lynx: An Open Source Hallucination Evaluation Model

Overview

The authors present "Lynx," a state-of-the-art (SOTA) hallucination detection model, alongside a comprehensive hallucination evaluation benchmark called "HaluBench." The motivations behind this work are rooted in the inherent limitations of LLMs in maintaining faithfulness to their retrieved contexts and the desire to improve the reliability of Retrieval Augmented Generation (RAG) systems. By providing a rigorous benchmark and an open-source model, the authors aim to set new standards for hallucination detection in diverse real-world applications.

Introduction

RAG techniques have been introduced to enhance the knowledge flexibility and extensibility of LLMs by allowing these models to access external data stores. However, LLMs often generate outputs that are poorly aligned or even contradictory to the provided contexts, resulting in what is referred to as "hallucinations." The authors note that existing models, including LLM-as-a-Judge systems, often fail to detect such hallucinations effectively, especially in specialized domains. Lynx is proposed to address these shortcomings by incorporating advanced reasoning capabilities and extensive evaluation datasets.

Contributions

The authors outline several key contributions:

  1. HaluBench: A benchmark with 15,000 samples that includes rigorous annotations of hallucinations from various real-world domains like finance and medicine.
  2. Lynx Model: An open-source hallucination detection LLM that outperforms other models such as GPT-4o and Claude-3-Sonnet.
  3. Generation of Hard-to-Detect Hallucinations: Introduction of semantic perturbations to generate challenging examples for training and evaluation.
  4. Comprehensive Benchmarking: Extensive experiments to validate the performance of Lynx against other state-of-the-art models.

Methodology

Hallucination Evaluation

The task of hallucination detection is formulated as determining the faithfulness of an answer with respect to its context in a RAG system. Intrinsic hallucinations occur when LLM outputs are inconsistent with the given contexts. The authors focus on intrinsic hallucinations, leaving the assessment of extrinsic hallucinations (those conflicting with universal factuality) outside the scope of this work.

HaluBench Construction

To create HaluBench, the authors source examples from various QA datasets like CovidQA, PubMedQA, DROP, and FinanceBench. They generate additional hallucinated samples through semantic perturbations of gold-standard answers. Human annotation verifies the quality and correctness of these perturbations, ensuring that the benchmark is both comprehensive and reliable.

Model Training

Lynx is based on Llama-3-70B-Instruct and Llama-3-8B-Instruct, fine-tuned using a dataset incorporating reasoning chains. This approach draws from methods like Chain of Thought (CoT) to generate detailed reasoning steps, enhancing the interpretability of the model's output. The training methodology ensures robustness across different domains and scenarios, making Lynx versatile and effective for hallucination detection tasks.

Results

Lynx (70B) exhibits superior performance on HaluBench, achieving an average accuracy of 87.4%, which is higher than any other evaluated model, including GPT-4o with 86.5% and Claude-3-Sonnet with 78.8%. The 70B version also outperforms other open-source models by significant margins, demonstrating its efficacy in diverse and complex domains. Particularly notable is Lynx's accuracy in specialized fields like the medical domain, where it surpasses other models by a substantial margin.

Implications and Future Work

Practical Implications

The open-source nature of Lynx holds significant practical implications. It permits the deployment of reliable hallucination detection in various critical domains such as finance and healthcare. By addressing the gap between closed and open-source LLMs, Lynx democratizes access to high-quality hallucination detection tools.

Theoretical Implications

The introduction of semantic perturbations for generating challenging hallucination examples sets a new standard in the construction of benchmark datasets. This methodology can be applied to other NLP tasks, pushing the boundaries of what LLMs can be trained and evaluated on.

Future Developments

Several areas are highlighted for future work:

  • Failures Outside LLM Generation: Exploration of retrieval component failures and their impact on hallucination detection.
  • Multilingual Coverage: Extending Lynx and HaluBench to cover non-English and low-resource languages.
  • NLP Task Extension: Applying Lynx to other tasks like abstractive summarization.
  • Truthfulness and World Knowledge: Incorporating external knowledge sources for evaluating the factuality of LLM outputs.
  • Natural Language Inference (NLI): Investigating the application of Lynx in NLI tasks given the inherent similarities.

Conclusion

The paper presents significant advancements in the detection of hallucinations in LLM-generated outputs through the development of Lynx and the HaluBench benchmark. By outperforming both open and closed-source alternatives, Lynx sets a new benchmark for faithfulness evaluation in RAG systems. The authors contribute valuable resources to the research community, facilitating further exploration and improvement in the reliability and robustness of AI-generated text.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube