Emergent Mind

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

(2409.01437)
Published Sep 2, 2024 in cs.CV and cs.AI

Abstract

We introduce Kvasir-VQA, an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations to facilitate advanced machine learning tasks in Gastrointestinal (GI) diagnostics. This dataset comprises 6,500 annotated images spanning various GI tract conditions and surgical instruments, and it supports multiple question types including yes/no, choice, location, and numerical count. The dataset is intended for applications such as image captioning, Visual Question Answering (VQA), text-based generation of synthetic medical images, object detection, and classification. Our experiments demonstrate the dataset's effectiveness in training models for three selected tasks, showcasing significant applications in medical image analysis and diagnostics. We also present evaluation metrics for each task, highlighting the usability and versatility of our dataset. The dataset and supporting artifacts are available at https://datasets.simula.no/kvasir-vqa.

Fine-tuned VQA model example on Task 2.

Overview

  • The Kvasir-VQA dataset enhances existing GI tract datasets by integrating image annotations with question-and-answer pairs, aiding advanced ML tasks like Visual Question Answering and image captioning.

  • The dataset includes 6,500 annotated images covering a range of GI conditions and surgical instruments, developed in collaboration with medical professionals.

  • Experiments using the dataset for image captioning, VQA, and synthetic medical image generation demonstrated its utility in improving diagnostic AI models.

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

The paper "Kvasir-VQA: A Text-Image Pair GI Tract Dataset" presents a novel dataset aimed at enhancing the capabilities of ML models in gastrointestinal (GI) diagnostics through the integration of vision and language models. This work builds upon the existing HyperKvasir and Kvasir-Instrument datasets by adding detailed question-and-answer annotations, thus enabling a broader range of sophisticated applications. The dataset includes 6,500 annotated images covering various GI tract conditions and surgical instruments and supports multiple question types, such as yes/no, choice, location, and numerical count.

Key Contributions

  1. Dataset Enhancement: The Kvasir-VQA dataset is an enriched version of the previously provided datasets. This dataset not only includes images but also incorporates textual annotations in the form of question-and-answer pairs. This addition facilitates advanced ML tasks such as Visual Question Answering (VQA), image captioning, and the generation of synthetic medical images.

  2. Comprehensive Annotation: The dataset includes 6,500 annotated images with detailed question-and-answer pairs, developed in collaboration with medical professionals. These annotations cover a broad spectrum of GI tract conditions, from normal findings to serious diseases like polyps, esophagitis, and ulcerative colitis, and span various surgical instruments.

  3. Facilitating Multiple Tasks: The versatility of the dataset is demonstrated through experiments in image captioning, VQA, and synthetic medical image generation. This versatility underscores the dataset's potential to improve GI diagnostics by enabling the development of different types of AI models, which can be trained to produce automatic medical reports, provide interactive diagnoses, and generate high-fidelity synthetic medical images.

  4. Synthetic VQA Dataset Generation Using LLMs: The authors innovatively utilized the LLaMA-3 (7B) language model to create synthetic question-and-answer pairs from existing image captions, significantly enhancing the scope of the VQA tasks.

Numerical Results and Model Performances

The dataset's efficacy was demonstrated through three primary experiments:

  1. Image Captioning: The Florence-2 model, fine-tuned for image captioning, generated high-quality descriptive captions for medical images. The performance was evaluated using BLEU, ROUGE, METEOR, and CIDEr metrics, achieving notable results that underscore the model’s capability to accurately describe medical images.

  2. Visual Question Answering (VQA): By fine-tuning the Florence-2 model, the authors were able to effectively answer specific questions about the images. The model's performance on metrics such as BLEU, ROUGE, METEOR, and CIDEr revealed its proficiency in providing context-aware responses, further validating the dataset's utility in training VQA models.

  3. Synthetic Medical Image Generation: The Stable Diffusion 3 model was fine-tuned to generate synthetic medical images based on textual prompts. Evaluated using FID and Inception Score (IS), the synthetic images demonstrated a good balance of fidelity and diversity, which is critical for augmenting training datasets in scenarios where real medical images are scarce.

Implications and Future Directions

The Kvasir-VQA dataset presents significant practical and theoretical advancements for AI research in medical diagnostics:

Practical Implications:

The integrated vision-language approach facilitated by Kvasir-VQA can significantly enhance the accuracy and efficiency of GI diagnostics. Automated image captioning and VQA can assist clinicians by providing detailed, contextually accurate information, thereby reducing workload and the potential margin of error. Additionally, the ability to generate synthetic images can address the issue of data scarcity and class imbalance, improving the robustness of diagnostic tools.

Theoretical Implications:

From a theoretical standpoint, the combination of vision and language models for complex medical images opens avenues for new research in multimodal AI. This work highlights the potential of integrating textual and visual information to develop models that can interpret and interact with complex visual data in a human-like manner.

Limitations and Future Work

Despite its strengths, certain limitations persist. Not all annotations were verified by medical experts due to time constraints, and the current dataset does not cover the full spectrum of GI conditions. Future work should include complete expert validation of annotations and an expanded scope to include additional GI conditions. Moreover, leveraging the entire dataset with appropriate training, validation, and test splits will ensure more robust and reproducible results.

Conclusion

The Kvasir-VQA dataset represents a significant advancement in the field of medical image analysis, particularly for GI diagnostics. By supporting a variety of applications such as image captioning, VQA, and synthetic image generation, the dataset is poised to enhance the development of sophisticated diagnostic tools. Future enhancements will focus on expanding the dataset's scope and ensuring comprehensive expert validation, thus solidifying its role as a critical resource for advancing AI in healthcare.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.