Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data (1804.01486v3)

Published 4 Apr 2018 in cs.CL, cs.AI, and stat.ML

Abstract: Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. To evaluate our approach, we present a new benchmark methodology based on statistical power specifically designed to test embeddings of medical concepts. Our approach, called cui2vec, attains state-of-the-art performance relative to previous methods in most instances. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings

Citations (170)

Summary

  • The paper introduces cui2vec, a novel strategy for learning clinical concept embeddings from massive multimodal data sources mapped to a unified concept space.
  • Authors validated the cui2vec embeddings using statistical benchmarks, showing superior performance in detecting various medical relationships.
  • The development includes open-source tools and facilitates applications in clinical decision support and semantic interoperability.

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data

The paper "Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data" introduces an extensive and novel embedding strategy termed cui2vec, specifically designed to construct embeddings from diverse forms of healthcare data. The research aims to mitigate the limitations posed by the disparate sources and the unstructured nature of medical data and provides a unified framework for representing medical concepts. This paper stands as a noteworthy reference point due to its approach of utilizing massive amounts of multimodal data, permitting representation learning for an extensive range of medical concepts.

The authors utilized multiple data sources, including a claims database of over 60 million patients, a collection of 20 million clinical notes, and 1.7 million full-text biomedical journal articles, to create embeddings for 108,477 medical concepts. A pivotal aspect of this research lies in its capability to map these varied data into a common concept unique identifier space using the UMLS. This mapping facilitates the construction of a unified co-occurrence matrix from different data modalities, allowing for the creation of a robust set of concept embeddings.

The methodology revolves around adapting established NLP techniques such as word2vec and GloVe to the medical domain. The authors perform a set of benchmarks to validate the effectiveness of their embeddings. The cui2vec representation consistently shows superior performance on tasks involving the detection of known medical relationships, measurable through statistical power metrics. Notably, the paper introduces a practical evaluation strategy based on statistically driven benchmarks, presenting a nuanced approach distinct from conventional ranking methods. This provides a comprehensive way to test embedding efficacy concerning real-world clinical relationships while addressing the challenges posed by non-explicit data labeling.

Results from the various benchmarks indicate that the embeddings generated by the cui2vec framework exhibit enhanced accuracy compared to previous methods, most notably performing well in detecting causative relationships and comorbid conditions. The embeddings generated surpass those derived from single-source datasets and other methodologies, offering a unified representation space. Furthermore, the development of an open-source R package alongside an interactive internet-based exploration tool demonstrates the authors' intent to provide accessible resources for broader utilization by the research community.

The implications of this work are substantial. By improving the representation of medical concepts using embeddings derived from large-scale and multimodal datasets, cui2vec facilitates applications in clinical decision support, semantic interoperability, and potential advancements in personalized medicine. Future developments could focus on integrating more diverse datasets, further optimizing hyperparameters, and applying these embeddings in various practical healthcare settings. Moreover, extending this framework to encompass emerging data types could augment its applicability and contribute substantially to informed decision-making in clinical contexts.

Overall, "Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data" positions itself as a significant contribution to medical informatics, providing foundational tools and insights into the developing field of healthcare NLP. The cui2vec embeddings offer a potent resource, potentially influencing numerous applications within artificial intelligence in medicine, enhancing both theoretical exploration and practical implementation advancements.