ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks (2407.18525v2)
Abstract: LLMs are increasingly deployed in medicine. However, their utility in non-generative clinical prediction, often presumed inferior to specialized models, remains under-evaluated, leading to ongoing debate within the field and potential for misuse, misunderstanding, or over-reliance due to a lack of systematic benchmarking. Our ClinicRealm study addresses this by benchmarking 9 GPT-based LLMs, 5 BERT-based models, and 7 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR). Key findings reveal a significant shift: for clinical note predictions, leading LLMs (e.g., DeepSeek R1/V3, GPT o3-mini-high) in zero-shot settings now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs (e.g., GPT-4o, DeepSeek R1/V3) show potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs can match or exceed proprietary counterparts. These results establish modern LLMs as powerful non-generative clinical prediction tools, particularly with unstructured text and offering data-efficient structured data options, thus necessitating a re-evaluation of model selection strategies. This research should serve as an important insight for medical informaticists, AI developers, and clinical researchers, potentially prompting a reassessment of current assumptions and inspiring new approaches to LLM application in predictive healthcare.