Generalist embedding models are better at short-context clinical semantic search than specialized embedding models (2401.01943v2)
Abstract: The increasing use of tools and solutions based on LLMs for various tasks in the medical domain has become a prominent trend. Their use in this highly critical and sensitive domain has thus raised important questions about their robustness, especially in response to variations in input, and the reliability of the generated outputs. This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task where the goal was to correctly match the rephrased text to the original description. Our results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them. The highlighted problem of specialized models may be due to the fact that they have not been trained on sufficient data, and in particular on datasets that are not diverse enough to have a reliable global language understanding, which is still necessary for accurate handling of medical documents.
- Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
- Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics, 2023.
- Large language models in medicine: the potentials and pitfalls. arXiv preprint arXiv:2309.00087, 2023.
- The emerging role of generative artificial intelligence in medical education, research, and practice. Cureus, 15(6), 2023.
- Chatgpt and large language model (llm) chatbots: the current state of acceptability and a proposal for guidelines on utilization in academic medicine. Journal of Pediatric Urology, 2023.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Semantic information retrieval on medical texts: Research challenges, survey, and open issues. ACM Computing Surveys (CSUR), 54(7):1–38, 2021.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Faithful ai in medicine: A systematic review with large language models and beyond. medRxiv, 2023.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
- Stefan Harrer. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine, 90, 2023.
- Are large language models ready for healthcare? a comparative study on clinical language understanding. arXiv preprint arXiv:2304.05368, 2023.
- A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694, 2023.
- Enriching unsupervised user embedding via medical concepts. In Conference on Health, Inference, and Learning, pages 63–78. PMLR, 2022.
- Icd-10: history and context. American Journal of Neuroradiology, 37(4):596–599, 2016.
- The road to icd-10-cm/pcs implementation: forecasting the transition for providers, payers, and other healthcare organizations. Perspectives in health information management/AHIMA, American Health Information Management Association, 9(Winter), 2012.
- Shashank Mohan Jain. Hugging face. In Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems, pages 51–67. Springer, 2022.
- Improved methods to aid unsupervised evidence-based fact checking for online health news. Journal of Data Intelligence, 3(4):474–504, 2022.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. arXiv preprint arXiv:2012.15828, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. arXiv preprint arXiv:2310.19923, 2023.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
- Clinical outcome prediction from admission notes using self-supervised knowledge integration. arXiv preprint arXiv:2102.04110, 2021.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
- Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838, 2022.
- A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523, 2023.
- A systematic literature review of automated icd coding and classification systems using discharge summaries. arXiv preprint arXiv:2107.10652, 2021.
- Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
- Performance analysis of large language models for medical text summarization, 2023.
- Benchmarking large language models on cmexam–a comprehensive chinese medical exam dataset. arXiv preprint arXiv:2306.03030, 2023.
- The future landscape of large language models in medicine. Communications Medicine, 3(1):141, 2023.