Emergent Mind

OLAPH: Improving Factuality in Biomedical Long-form Question Answering

(2405.12701)
Published May 21, 2024 in cs.CL and cs.AI

Abstract

In the medical domain, numerous scenarios necessitate the long-form generation ability of LLMs. Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate the automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that enables the improvement of factuality through automatic evaluations. The OLAPH framework iteratively trains LLMs to mitigate hallucinations using sampling predictions and preference optimization. In other words, we iteratively set the highest-scoring response as a preferred response derived from sampling predictions and train LLMs to align with the preferred response that improves factuality. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available at https://github.com/dmis-lab/OLAPH}{https://github.com/dmis-lab/OLAPH.

MedLFQA provides GPT-4 answers on Lexapro's definition, advantages, disadvantages, and side effects for comprehensive evaluation.

Overview

  • The paper introduces the MedLFQA dataset, a novel benchmark designed to enhance the factuality of long-form question answering in the biomedical domain by providing detailed question-answer pairs with Must Have (MH) and Nice to Have (NH) statements.

  • The authors present the Olaph framework, which optimizes the factuality of responses from LLMs through a multi-step iterative process that includes supervised fine-tuning and direct preference optimization.

  • Evaluation and iterative training using the Olaph framework show significant improvements in model factuality, with models like BioMistral achieving performance levels comparable to expert-annotated responses and high-performing models like GPT-4.

MedLFQA: Enhancing the Factuality of Long-Form Medical Responses Using Olaph

The paper presents a new benchmark dataset, MedLFQA, tailored for long-form question answering (LFQA) in the biomedical domain. The dataset is specifically designed to evaluate the factuality of responses generated by LLMs. To improve the factuality of these responses, the authors introduce a novel framework named Olaph, which iteratively refines LLMs' outputs through a systematic preference optimization process.

Key Contributions

  1. MedLFQA Benchmark Dataset: The authors reconstructed existing biomedical LFQA datasets to create MedLFQA, which encompasses detailed question-answer pairs along with two types of crucial statements: Must Have (MH) and Nice to Have (NH). This reconstruction facilitates the automatic evaluation of model responses to ensure high factual accuracy.
  2. Olaph Framework: Olaph stands for "Optimizing Large language models' Answers with Preferences of mitigating Hallucination". It enhances factuality through a multi-step iterative training process, incorporating supervised fine-tuning (SFT) and direct preference optimization (DPO). The strongest responses are iteratively selected using comprehensive evaluation metrics covering word composition, semantic similarity, and factuality.

Methodology

MedLFQA Dataset Reconstruction

The MedLFQA dataset is built by integrating and reformatting several existing LFQA datasets, including LiveQA, MedicationQA, HealthSearchQA, and K-QA. The reconstruction involves not just providing answers but also generating MH and NH statements to precisely evaluate the factuality and relevance of the responses.

Evaluation Metrics:

  • Words Composition: Evaluates appropriate word usage.
  • Semantic Similarity: Uses BLEURT and BERTScore to capture non-trivial semantic similarities.
  • Factuality: Employs metrics like Hallucination and Comprehensiveness to measure the inclusion of crucial claims and the absence of false information.

Olaph Framework Details

  • Supervised Fine-tuning (SFT): Initially, the model is fine-tuned on a smaller labeled dataset to identify the question-answering task.
  • Preference Optimization: Through temperature scaling, multiple predictions are generated and filtered based on evaluation scores.
  • Direct Preference Optimization (DPO): Trains the model to prefer higher-scored predictions iteratively, discouraging low-quality responses.

The iterative process ensures that the model improves its response quality and factuality step-by-step, reducing hallucinations and aligning its answers with medically accurate information.

Results

Zero-shot Evaluation

The authors evaluated various open-foundation biomedical LLMs, including LLaMA2, Mistral, Meditron, Self-BioRAG, and BioMistral. The results unveiled inconsistencies in model performance, particularly in terms of factuality. Base models like LLaMA2 and Mistral generally showed lower factuality compared to specialized biomedical models like Meditron and BioMistral.

Iterative Learning

Analyses of iterative learning processes reveal substantial improvements in the model's factuality, even matching the high standards set by GPT-4. Figure 2 in the paper highlights that through iterative DPO training, models like BioMistral 7B exhibit enhanced performance, achieving scores comparable to expert-annotated and GPT-4 responses.

Implications and Future Directions

The findings underscore the necessity for robust LFQA benchmarks in the biomedical domain, given the critical importance of factuality in medical responses. The MedLFQA dataset and Olaph framework collectively represent a significant step towards developing more reliable and accurate biomedical LLMs.

Future research should focus on:

  • Enhanced Dataset Accuracy: Continued refinement of the MedLFQA dataset to eliminate potential inaccuracies and update outdated information.
  • Scalability: Testing and validating the Olaph framework on models with varying parameter sizes to understand its broader applicability.
  • Patient-specific Conversational Agents: Extending the framework to multi-turn conversations for comprehensive patient history understanding.

Conclusion

This paper introduces and validates a novel methodology to enhance the factual accuracy of long-form medical responses generated by LLMs. The MedLFQA dataset provides a rigorous benchmark for factuality, while the Olaph framework ensures iterative improvement in response quality. These contributions offer a promising direction for the development of reliable clinical conversation agents, potentially aiding medical professionals by providing accurate, detailed, and comprehensible medical information.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.