Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent (2407.02483v2)

Published 2 Jul 2024 in cs.CL and cs.AI

Abstract: Multi-Modal LLMs (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbf{M}ulti-modal \textbf{Med}ical \textbf{Agent} (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools. Codes and models are all available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Transactions on Medical Imaging, 33(2):577–590.
  2. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571.
  3. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  4. Ai hospital: Interactive evaluation and collaboration of llms as intern doctors for clinical diagnosis. arXiv preprint arXiv:2402.09742.
  5. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856.
  6. Ct2rep: Automated radiology report generation for 3d medical imaging. arXiv preprint arXiv:2403.06801.
  7. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  8. Automatic tuberculosis screening using chest radiographs. IEEE Transactions on Medical Imaging, 33(2):233–245.
  9. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. Preprint, arXiv:1901.07042.
  10. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026.
  11. Medlsam: Localize and segment anything model for 3d ct images. Preprint, arXiv:2306.14752.
  12. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Preprint, arXiv:2306.00890.
  13. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957.
  14. Microsoft coco: Common objects in context. 13.
  15. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437.
  16. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. Preprint, arXiv:2303.05499.
  17. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Preprint, arXiv:1711.05101.
  18. Word: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from ct image. Medical Image Analysis, 82:102642.
  19. Segment anything in medical images. Nature Communications, 15(1).
  20. The multi-modality cell segmentation challenge: Towards universal solutions. Nature Methods.
  21. Fast and low-gpu-memory abdomen ct organ segmentation: The flare challenge. Medical Image Analysis, 82:102616.
  22. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918.
  23. The multimodal brain tumor image segmentation benchmark (brats). IEEE Transactions on Medical Imaging, 34(10):1993–2024.
  24. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR.
  25. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. Scientific Data, 9(1):429.
  26. OpenAI. 2024. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2024-05-26.
  27. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. pages 2641–2649.
  28. Robert S. Porter and Justin L. Kaplan. 2011. The merck manual of diagnosis and therapy, 2011. Merck Research Laboratories.
  29. Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472.
  30. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960.
  31. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  32. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv preprint arXiv:2305.15072.
  33. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537.
  34. Webwise: Web interface control and sequential exploration with large language models. arXiv preprint arXiv:2310.16042.
  35. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971.
  36. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138.
  37. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407.
  38. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158.
  39. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. Preprint, arXiv:2302.07257.
  40. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997.
  41. Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152.
  42. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Preprint, arXiv:2308.02463.
  43. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116.
  44. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162.
  45. Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
  46. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100.
  47. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. Preprint, arXiv:2303.00915.
  48. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915.
  49. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415.
  50. Loop copilot: Conducting ai ensembles for music generation and iterative editing. arXiv preprint arXiv:2310.12404.
  51. Biomedparse: a biomedical foundation model for image parsing of everything everywhere all at once. arXiv preprint arXiv:2405.12971.
  52. Chatcad+: Towards a universal and reliable interactive cad using llms. IEEE Transactions on Medical Imaging, page 1–1.
Citations (12)

Summary

  • The paper introduces MMedAgent, which integrates multi-modal LLMs with specialized medical tools to improve various medical imaging tasks.
  • It employs an instruction-based framework and visual tuning methodology, achieving superior results in VQA, image classification, segmentation, and report generation.
  • The agent demonstrates high scalability and adaptability, effectively incorporating new tools while maintaining performance on established tasks.

"MMedAgent: Learning to Use Medical Tools with Multi-modal Agent"

Introduction

The paper introduces "MMedAgent," a novel system combining multi-modal LLMs (MLLMs) with domain-specific medical tools for enhanced performance in medical imaging tasks. MMLMs, despite their wide applicability, often lack the tailored performance of specialized models for specific medical imaging modalities. MMedAgent bridges this gap by providing a versatile architecture capable of selecting and utilizing specialized tools efficiently across various modalities, thereby improving results in tasks like Visual Question Answering (VQA), image classification, segmentation, and medical report generation.

Workflow of MMedAgent

MMedAgent's workflow begins with user-supplied queries and images, followed by multi-modal LLM processing to decide on necessary tool activations. The model then executes selected tools, aggregates the outputs, and formulates a comprehensive response. This automated process extends the capability of LLMs to provide detailed, expert-level responses without in-depth manual oversight by integrating an instruction-based framework that teaches the agent to choose appropriate tools. Figure 1

Figure 1: The four-step MMedAgent pipeline.

Instruction Tuning Dataset

To enable MMedAgent to make informed tool selections, the researchers curated a robust instruction-tuning dataset. This dataset encompasses instruction-based interactions covering various modalities and tasks. Notably, it includes six distinct medical tools that can solve seven different tasks, allowing MMedAgent to assist proficiently in grounding, segmentation, classification, report generation, and retrieval-augmented generation. Figure 2

Figure 2: An example of the training data for MMedAgent that learns to use the tool of Grounding DINO for object detection and answer the user's question.

Methodology

The core methodology involves training MMedAgent through visual instruction tuning, leveraging tools such as Grounding DINO for object detection. Grounding DINO, fine-tuned for medical applications, exemplifies the integration of open-set object detection frameworks within a medical context, enabling effective disease and organ identification across a spectrum of imaging modalities including MRI, CT, and X-rays.

The agent further deploys tools like MedSAM for segmentation tasks, enhancing the precision and adaptability of segmentation models particularly useful when tackling diverse and interactive medical imaging challenges.

Performance Evaluation

Quantitative analyses reveal MMedAgent's superior performance across multiple medical imaging tasks when compared with state-of-the-art models, including LLaVA-Med and RadFM. The tool selection accuracy notably improves MMedAgent's performance, achieving optimum scores in a variety of task types like organ and disease detection, medical image segmentation, and VQA enhancements.

Additionally, scalability tests indicate that MMedAgent not only integrates new tools efficiently but also retains proficiency in activating previously learned tools, demonstrating impressive adaptability and continuous improvement potential. Figure 3

Figure 3: Qualitative comparison between LLaVA-Med and MMedAgent across different tasks. The undesired and desired responses are highlighted in red and green, respectively.

Scalability and Adaptability

The agent's scalability has been validated via its ability to seamlessly incorporate new tools while maintaining performance on established tasks. Such flexibility is pivotal in an ever-evolving field like medical AI, where frequent updates and tool enhancements accelerate progress. Figure 4

Figure 4: The scalability of MMedAgent.

Conclusion

MMedAgent represents a significant advancement in leveraging MLLMs within healthcare, providing a bridge between generalist capabilities of LLMs and the specialized needs of medical imaging tasks. With its robust instruction-tuning dataset and tool integration strategy, MMedAgent emerges as a powerful system, effectively surpassing existing methods in various medical tasks while offering seamless adaptability to future developments and tools. The integration of a multi-modal approach with specialized tools paves the way for smart, reliable medical assistance in clinical settings.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 5 likes.

Upgrade to Pro to view all of the tweets about this paper: