Emergent Mind

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

(2406.19280)
Published Jun 27, 2024 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

The rapid development of multimodal LLMs (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Command prompt leveraging contextual information for AI-generated responses.

Overview

  • The paper introduces PubMedVision, a refined dataset with 1.3 million medical Vision Question Answering (VQA) samples aimed at enhancing Multimodal LLMs (MLLMs) in medical contexts.

  • A structured pipeline for curating high-quality medical image-text pairs from PubMed datasets was proposed, incorporating text filtering, image filtering, and deduplication to address noise and data quality issues.

  • Experiments demonstrated significant performance improvements in medical VQA benchmarks, traditional medical imaging tasks, and multimodal benchmarks thanks to the new dataset and methodologies introduced.

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

The paper "HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale" addresses two key challenges in the development of medical-centric Multimodal LLMs (MLLMs): the scarcity of high-quality medical vision-text data and the noise inherent in existing large-scale datasets. The authors propose PubMedVision, a refined dataset comprising 1.3 million medical Vision Question Answering (VQA) samples, which significantly enhances the capabilities of MLLMs in medical contexts.

Methodology

Data Collection and Refinement

The paper begins by highlighting the limitations of existing medical multimodal datasets, such as those derived from PubMed, which contain de-identified medical images and textual descriptions but suffer from high noise levels. To overcome these deficiencies, the authors implemented a structured pipeline to curate high-quality medical image-text pairs from various PubMed datasets. This pipeline includes:

  1. Text Filtering: Utilizing a medical vocabulary to filter out poorly relevant data.
  2. Image Filtering: Removing low-resolution images and those irrelevant to medical contexts using a classification model.
  3. Deduplication: Applying semantic retrieval techniques to ensure uniqueness and high-quality of the selected data.

Denoising and Data Reformatting

To denoise and reformat the data, the authors utilized GPT-4V in an "unblinded" capacity. Unlike previous approaches that utilized text-only LLMs (denoted as "blinded"), this method leverages the combination of visual and textual information, ensuring more accurate and relevant data synthesis. The resulting dataset, named PubMedVision, includes both detailed descriptions (Alignment VQA) and context-specific question-answer pairs (Instruction-Tuning VQA).

Experimental Setup and Results

The experiments were designed to validate the efficacy of PubMedVision in enhancing MLLMs' performance on medical tasks. The authors compared the performance of LLaVA-1.5-LLaMA3-8B trained with PubMedVision against several baseline models, including Med-Flamingo, RadFM, LLaVA-Med-7B, and general MLLMs like LLaVA-v1.6-34B. The evaluation was conducted using three benchmark types:

  1. Medical VQA Benchmarks: Including VQA-RAD, SLAKE, PathVQA, and PMC-VQA.
  2. Multimodal Benchmark: MMMU's Health {content} Medicine track.
  3. Traditional Medical Imaging Tasks: Using datasets from OmniMedVQA.

Key Findings

  1. Medical VQA Benchmarks: Models trained with PubMedVision showed notable improvements, with an 11.7% increase in overall accuracy compared to previous datasets.
  2. Traditional Medical Imaging: The incorporation of PubMedVision improved model performance by 26.3%, surpassing other methods.
  3. Multimodal Benchmarks (MMMU): LLaVA-v1.5-LLaMA3-8B combined with PubMedVision achieved performance levels comparable to the larger LLaVA-v1.6-34B model.

To further validate the robustness of PubMedVision, the authors trained HuatuoGPT-Vision, a 34B parameter MLLM model, which outperformed existing medical and general MLLMs across several benchmarks.

Data Quality Examination

The quality of data generated with different reformatting techniques was assessed both through expert evaluation and empirical testing. The results indicated that the "MLLM-Reformatted" method used in PubMedVision outperforms other methods, such as "Native Captions" and "LLM-Reformatted". Experts rated MLLM-Reformatted data higher in terms of accuracy, relevance, completeness, and usefulness.

Implications and Future Directions

The PubMedVision dataset and the methodologies employed in its creation have significant implications for the development of specialized MLLMs with enhanced medical visual understanding. The refined data approach and the effective reformatting pipeline present a scalable solution to overcoming the challenges posed by noisy, large-scale datasets. Future research should focus on optimizing the quality assurance processes for such datasets and expanding the diversity of scenarios used in VQA generation to cover an even broader range of medical applications.

Additionally, exploring the integration of this approach with new advancements in NLP and computational medicine could further push the boundaries of what is possible with multimodal LLMs in healthcare. By continually improving the quality and scale of medical datasets, we pave the way for more reliable and sophisticated AI systems capable of assisting in complex medical decision-making processes.

In conclusion, the development of PubMedVision represents a significant step forward in the field of medical MLLMs, providing a robust foundation for future advancements in medical AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.