Generalist embedding models are better at short-context clinical semantic search than specialized embedding models (2401.01943v2)

Published 3 Jan 2024 in cs.CL and cs.AI

Abstract: The increasing use of tools and solutions based on LLMs for various tasks in the medical domain has become a prominent trend. Their use in this highly critical and sensitive domain has thus raised important questions about their robustness, especially in response to variations in input, and the reliability of the generated outputs. This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task where the goal was to correctly match the rephrased text to the original description. Our results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them. The highlighted problem of specialized models may be due to the fact that they have not been trained on sufficient data, and in particular on datasets that are not diverse enough to have a reliable global language understanding, which is still necessary for accurate handling of medical documents.

References (38)

Citations (2)

View on Semantic Scholar

Summary

The paper finds that generalist embedding models achieve 84.0% exact matching compared to 64.4% by specialized models in clinical semantic search.
It employs a dataset of 100 ICD-10-CM codes with rephrased descriptions to simulate diverse clinical language.
The study underscores that broad language training enables models to better navigate nuanced, short-context clinical text.

Introduction

In the landscape of medical informatics, embedding models serve as fundamental tools in semantic search tasks—processes vital for the retrieval of clinical information from vast datasets. Such models convert text into numerical vectors, which can be compared to find the most similar pieces of content. A recent evaluation focused on a comparison between general LLMs and those specialized for clinical purposes, examining their performance in semantic search tasks using clinical diagnostic information from ICD-10-CM codes.

Methodology and Dataset

The ICD-10-CM codes, a cornerstone in U.S. hospital systems for coding diagnoses, provided the foundation for this paper. A dataset was generated consisting of 100 ICD-10-CM codes, each with a main description and ten reformulated phrases intended to simulate how varied wording can appear in genuine medical documents. LLM ChatGPT 3.5 turbo produced these rephrasings, deliberately diversifying from the original descriptions. The selected models underwent performance tests using these rephrasings as queries in a semantic search task to match them with the appropriate ICD-10-CM code description.

Two central conditions governed the choice of models: the requirement for CPU-only operability for widespread accessibility and cost-effectiveness, and the preference for free and commonly used models from established repositories.

Results

When the results came in, generalist models like jina-embeddings-v2-base-en outpaced their specialized counterparts by significant margins across exact and category matching and character error rate (CER) metrics. The leading generalist model exhibited an exact matching rate of 84.0%, starkly higher than the top-performing specialized model, ClinicalBERT, at 64.4%. Such outcomes paint a nuanced picture: while clinical embedding models are honed for medical terminology, it is the generalist models, with their exposure to a broader linguistic landscape, that demonstrate greater resilience against variations in clinical text.

Conclusion and Implications

The inference drawn from this head-to-head pits generalist models as more adept at the task of short-context clinical semantic search than their specialized analogs. The breadth of training data, including non-medical content, seems to endow these models with superior versatility to grasp nuanced language use as found in healthcare settings. The findings resonate with current dialogues on LLM utility in clinical applications, suggesting that for certain tasks, a robust general language understanding may be more valuable than specialized knowledge. With this new insight, future research may explore wider or deeper contexts, perhaps tapping into full-length medical documents or benchmarking against newer, more advanced models. The research affirms that the path to refining LLMs for medical use may well rely on their ability to navigate a diverse array of human language.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bo_wangbo/status/1743226041190240458

https://twitter.com/BrianRoemmele/status/1743316306903122078

https://twitter.com/gastronomy/status/1743429707758985648

https://twitter.com/arxivsanitybot/status/1743624949024092552

https://twitter.com/HyperMindAI/status/1744125039300018325