OLAPH: Improving Factuality in Biomedical Long-form Question Answering (2405.12701v3)

Published 21 May 2024 in cs.CL and cs.AI

Abstract: In the medical domain, numerous scenarios necessitate the long-form generation ability of LLMs. Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available.

Citations (5)

View on Semantic Scholar

Summary

The paper presents a novel framework, OLAPH, that iteratively refines LLM outputs using supervised fine-tuning and direct preference optimization to reduce factual errors.
It introduces MedLFQA, a benchmark dataset reconstructed from multiple biomedical QA sources with tailored evaluation metrics like BLEURT, BERTScore, and hallucination measures.
Iterative learning with OLAPH significantly boosts factuality, achieving performance on par with expert annotations and models like GPT-4.

MedLFQA: Enhancing the Factuality of Long-Form Medical Responses Using Olaph

The paper presents a new benchmark dataset, MedLFQA, tailored for long-form question answering (LFQA) in the biomedical domain. The dataset is specifically designed to evaluate the factuality of responses generated by LLMs. To improve the factuality of these responses, the authors introduce a novel framework named Olaph, which iteratively refines LLMs' outputs through a systematic preference optimization process.

Key Contributions

MedLFQA Benchmark Dataset: The authors reconstructed existing biomedical LFQA datasets to create MedLFQA, which encompasses detailed question-answer pairs along with two types of crucial statements: Must Have (MH) and Nice to Have (NH). This reconstruction facilitates the automatic evaluation of model responses to ensure high factual accuracy.
Olaph Framework: Olaph stands for "Optimizing LLMs' Answers with Preferences of mitigating Hallucination". It enhances factuality through a multi-step iterative training process, incorporating supervised fine-tuning (SFT) and direct preference optimization (DPO). The strongest responses are iteratively selected using comprehensive evaluation metrics covering word composition, semantic similarity, and factuality.

Methodology

MedLFQA Dataset Reconstruction

The MedLFQA dataset is built by integrating and reformatting several existing LFQA datasets, including LiveQA, MedicationQA, HealthSearchQA, and K-QA. The reconstruction involves not just providing answers but also generating MH and NH statements to precisely evaluate the factuality and relevance of the responses.

Evaluation Metrics:
- Words Composition: Evaluates appropriate word usage.
- Semantic Similarity: Uses BLEURT and BERTScore to capture non-trivial semantic similarities.
- Factuality: Employs metrics like Hallucination and Comprehensiveness to measure the inclusion of crucial claims and the absence of false information.

Olaph Framework Details

Supervised Fine-tuning (SFT): Initially, the model is fine-tuned on a smaller labeled dataset to identify the question-answering task.
Preference Optimization: Through temperature scaling, multiple predictions are generated and filtered based on evaluation scores.
Direct Preference Optimization (DPO): Trains the model to prefer higher-scored predictions iteratively, discouraging low-quality responses.

The iterative process ensures that the model improves its response quality and factuality step-by-step, reducing hallucinations and aligning its answers with medically accurate information.

Results

Zero-shot Evaluation

The authors evaluated various open-foundation biomedical LLMs, including LLaMA2, Mistral, Meditron, Self-BioRAG, and BioMistral. The results unveiled inconsistencies in model performance, particularly in terms of factuality. Base models like LLaMA2 and Mistral generally showed lower factuality compared to specialized biomedical models like Meditron and BioMistral.

Iterative Learning

Analyses of iterative learning processes reveal substantial improvements in the model's factuality, even matching the high standards set by GPT-4. Figure 2 in the paper highlights that through iterative DPO training, models like BioMistral 7B exhibit enhanced performance, achieving scores comparable to expert-annotated and GPT-4 responses.

Implications and Future Directions

The findings underscore the necessity for robust LFQA benchmarks in the biomedical domain, given the critical importance of factuality in medical responses. The MedLFQA dataset and Olaph framework collectively represent a significant step towards developing more reliable and accurate biomedical LLMs.

Future research should focus on:

Enhanced Dataset Accuracy: Continued refinement of the MedLFQA dataset to eliminate potential inaccuracies and update outdated information.
Scalability: Testing and validating the Olaph framework on models with varying parameter sizes to understand its broader applicability.
Patient-specific Conversational Agents: Extending the framework to multi-turn conversations for comprehensive patient history understanding.

Conclusion

This paper introduces and validates a novel methodology to enhance the factual accuracy of long-form medical responses generated by LLMs. The MedLFQA dataset provides a rigorous benchmark for factuality, while the Olaph framework ensures iterative improvement in response quality. These contributions offer a promising direction for the development of reliable clinical conversation agents, potentially aiding medical professionals by providing accurate, detailed, and comprehensible medical information.

PDF Markdown

Related Papers

GitHub

GitHub - dmis-lab/OLAPH: OLAPH: Improving Factuality in Biomedical Long-form Question Answering (33 stars)

Tweets

https://twitter.com/realmofresearch/status/1794043253446308131