Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks (2404.00376v2)

Published 30 Mar 2024 in cs.CL

Abstract: While recent advancements in commercial LLMs (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.

Citations (11)

View on Semantic Scholar

Summary

The paper demonstrates that integrating chain-of-thought reasoning with synthetic medical data significantly improves diagnostic accuracy in small language models.
It details a training regimen using data from 18 medical textbooks and synthetic CoT sets, achieving an average accuracy of 64.2% across seven benchmarks.
The model outperforms larger systems like GPT-3.5 on USMLE-style tests, marking a significant advancement in open-source medical AI.

Small LLMs Learn Enhanced Reasoning Skills from Medical Textbooks

Introduction to Meerkat-7B

The emergence of Meerkat-7B introduces a paradigm shift in the field of medical AI, marking a significant advancement in leveraging the potential of smaller-scale models for complex problem-solving. Rooted in the novel approach of training on a combination of chain-of-thought (CoT) reasoning paths sourced from medical textbooks and synthetic dataset creation, Meerkat-7B pushes the boundaries of what small LLMs can achieve in medical diagnostics and decision-making tasks. With 7 billion parameters, Meerkat-7B not only excels in medical benchmark performances but also poses as a pioneering solution to the limitations imposed by the closed-source nature of commercial LLMs in sensitive fields such as healthcare.

Exceptional Benchmark Performance

Across a spectrum of seven medical benchmark datasets, Meerkat-7B has demonstrated remarkable proficiency, overshadowing its predecessors and even challenging the supremacy of significantly larger models like GPT-3.5. Specifically, Meerkat-7B achieved an average accuracy of 64.2%, presenting substantial improvements over GPT-3.5, MediTron-7B, and BioMistral-7B, by 13.1%, 13.4%, and 9.8% respectively. The model showcases its prowess in accurately addressing USMLE-style questions, exceeding the passing threshold with scores of 74.3% and 71.4% on MedQA and the USMLE Sample Test—benchmarks previously dominated by larger models.

Training with CoT and Synthetic Data

A cornerstone of Meerkat-7B's training regimen involves the integration of CoT reasoning paths from 18 expansive medical textbooks. This approach not only enriches the model's understanding and reasoning skills but also augments its capability to process and interpret complex medical information. Through fine-tuning on both MedQA-CoT and the newly introduced MedBooks-CoT-18 dataset, a synthetic collection of question-answer pairs, the model’s proficiency was significantly enhanced. This novel method of synthesizing data from authoritative sources has proven instrumental in bridging the gap between small models and their larger counterparts.

Implications and Future Prospects

The development and open-source availability of Meerkat-7B present promising avenues for advancing AI-assisted medical diagnostics and decision-making processes. By achieving superior performance in medical benchmarks, the model narrows the existing performance gap between small-scale models and LLMs, offering a viable alternative for applications requiring sensitive data handling. The innovative approach of leveraging CoT reasoning and synthetic dataset generation from textbooks exemplifies a scalable strategy for enhancing small models' capabilities. As Meerkat-7B continues to evolve, its integration with preference alignment techniques and the development of retrieval-augmented methodologies stand as potential future directions to further refine its performance and reliability in real-world medical applications.

In conclusion, Meerkat-7B epitomizes a significant stride towards democratizing access to high-performing medical AI systems, underscoring the feasibility of small models in tackling intricate challenges that were once the sole domain of LLMs. This breakthrough heralds a new era for AI in medicine, emphasizing the importance of innovative data utilization and model training strategies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/imperial_nlp/status/1777270225064321210

https://twitter.com/BraydonDymm/status/1812974342218871206