Emergent Mind

Abstract

The increasing use of tools and solutions based on LLMs for various tasks in the medical domain has become a prominent trend. Their use in this highly critical and sensitive domain has thus raised important questions about their robustness, especially in response to variations in input, and the reliability of the generated outputs. This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task where the goal was to correctly match the rephrased text to the original description. Our results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them. The highlighted problem of specialized models may be due to the fact that they have not been trained on sufficient data, and in particular on datasets that are not diverse enough to have a reliable global language understanding, which is still necessary for accurate handling of medical documents.

Overview

  • The paper analyzes the efficacy of LLMs versus specialized clinical embedding models in short-context clinical semantic search tasks.

  • A dataset utilizing ICD-10-CM codes with main descriptions and ten varied rephrasings was used for testing model performance.

  • Generalist models performed better in matching rephrased queries to the correct ICD-10-CM codes, with jina-embeddings-v2-base-en leading at an 84.0% exact match rate.

  • Findings suggest that broad training across different domains makes generalist models more versatile in handling language variation in clinical settings.

  • The study supports the potential of LLMs in clinical applications, with emphasis on the advantages of a general language understanding over specialized knowledge.

Introduction

In the landscape of medical informatics, embedding models serve as fundamental tools in semantic search tasks—processes vital for the retrieval of clinical information from vast datasets. Such models convert text into numerical vectors, which can be compared to find the most similar pieces of content. A recent evaluation focused on a comparison between general LLMs and those specialized for clinical purposes, examining their performance in semantic search tasks using clinical diagnostic information from ICD-10-CM codes.

Methodology and Dataset

The ICD-10-CM codes, a cornerstone in U.S. hospital systems for coding diagnoses, provided the foundation for this study. A dataset was generated consisting of 100 ICD-10-CM codes, each with a main description and ten reformulated phrases intended to simulate how varied wording can appear in genuine medical documents. LLM ChatGPT 3.5 turbo produced these rephrasings, deliberately diversifying from the original descriptions. The selected models underwent performance tests using these rephrasings as queries in a semantic search task to match them with the appropriate ICD-10-CM code description.

Two central conditions governed the choice of models: the requirement for CPU-only operability for widespread accessibility and cost-effectiveness, and the preference for free and commonly used models from established repositories.

Results

When the results came in, generalist models like jina-embeddings-v2-base-en outpaced their specialized counterparts by significant margins across exact and category matching and character error rate (CER) metrics. The leading generalist model exhibited an exact matching rate of 84.0%, starkly higher than the top-performing specialized model, ClinicalBERT, at 64.4%. Such outcomes paint a nuanced picture: while clinical embedding models are honed for medical terminology, it is the generalist models, with their exposure to a broader linguistic landscape, that demonstrate greater resilience against variations in clinical text.

Conclusion and Implications

The inference drawn from this head-to-head pits generalist models as more adept at the task of short-context clinical semantic search than their specialized analogs. The breadth of training data, including non-medical content, seems to endow these models with superior versatility to grasp nuanced language use as found in healthcare settings. The findings resonate with current dialogues on LLM utility in clinical applications, suggesting that for certain tasks, a robust general language understanding may be more valuable than specialized knowledge. With this new insight, future research may explore wider or deeper contexts, perhaps tapping into full-length medical documents or benchmarking against newer, more advanced models. The research affirms that the path to refining LLMs for medical use may well rely on their ability to navigate a diverse array of human language.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.