Medical Spoken Named Entity Recognition (2406.13337v2)

Published 19 Jun 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed

Authors (7)

Khai Le-Duc (11 papers)
David Thulke (14 papers)
Hung-Phong Tran (2 papers)
Long Vo-Dang (3 papers)
Khai-Nguyen Nguyen (7 papers)
Truong-Son Hy (22 papers)
Ralf Schlüter (73 papers)

Summary

The paper presents VietMed-NER, the largest medical spoken NER dataset featuring 18 entity types annotated via Recursive Greedy Mapping.
The study employs a two-stage pipeline with state-of-the-art ASR and NER models, achieving an F1 score of 74.0% on reference texts.
The findings underscore the effectiveness of multilingual models and novel annotation methods in enhancing NER performance for low-resource spoken languages.

Medical Spoken Named Entity Recognition: Overview and Analysis

The paper "Medical Spoken Named Entity Recognition" introduces VietMed-NER, a pioneer dataset focusing on Named Entity Recognition (NER) within the medical domain of spoken language. This dataset addresses the complexities and challenges associated with NER tasks specifically tailored for medical conversations in Vietnamese, a largely underrepresented language in spoken NER research. The dataset is presented as the largest of its kind, featuring 18 distinct entity types pertinent to the medical domain, showcasing a substantial contribution to the field.

Dataset and Methodology

The VietMed-NER dataset is constructed using real-world audio from the VietMed ASR dataset. It introduces 18 medically relevant entity types, integrated into 9,000 annotated sentences. The distribution into training, development, and test sets, adhering to respective durations, is aimed at leveraging the capabilities of large pre-trained models. The annotation process employs a novel methodology titled "Recursive Greedy Mapping." This approach is devised to enhance annotation efficiency and ensure consistency, counteracting challenges of data quality, such as missing entity tags and segmentation errors traditionally seen in other datasets.

Experimental Setup and Models

The research employs a two-stage pipeline approach for spoken NER, encompassing first the transcription of audio using ASR models followed by NER. For Automatic Speech Recognition (ASR), the paper utilizes models pre-trained on extensive Vietnamese data, specifically examining w2v2-Viet and XLSR-53-Viet models. The Word Error Rates (WERs) observed were 29.0% and 28.8%, respectively.

For the NER task, a comparative analysis of multiple state-of-the-art monolingual and multilingual pre-trained models is conducted. This includes models like PhoBERT, ViDeBERTa, XLM-R, and others varying significantly in the volume of pre-training data. XLM-R models, trained on 2.5TB of multilingual data, consistently outperform their monolingual counterparts, demonstrating the advantage of extensive pre-training and multilingual capability in NER tasks.

Results and Performance

The NER model analysis on reference text indicated that PhoBERT_base-v2 outperformed smaller monolingual counterparts, likely benefiting from increased training data. However, XLM-R large, a multilingual model with extensive training data, yielded superior results across both reference text and ASR outputs, with an F1 score of 74.0% on reference text, emphasizing the efficacy of larger multilingual models for spoken NER tasks.

Implications and Future Directions

The introduction of the VietMed-NER dataset promises advancements in various medical language processing applications, such as correcting errors in medical ASR outputs or enhancing privacy in speech data mining. Future research directions could include exploring more robust models tailored for low-resource languages like Vietnamese and further refining the annotation methodology for consistency across multilingual datasets.

The findings suggest that multilingual models with sufficient pre-training can offer substantial improvements in NER tasks, potentially guiding future research in AI towards focusing on data diversity and the transferability of model capabilities across languages. Additionally, exploring new annotation methodologies like Recursive Greedy Mapping could pave the way for innovation in dataset creation and reliability, particularly in resource-constrained settings.

Related Papers

GitHub

GitHub - leduckhai/MultiMed: Multilingual Multitask Multipurpose Medical Speech Recognition (87 stars)

Tweets

https://twitter.com/AudioAndSpeech/status/1815837934471221283

https://twitter.com/AudioAndSpeech/status/1907779552228589668

https://twitter.com/_leduckhai_/status/1915243007600709683