Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition (2402.18923v1)

Published 29 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%)

References (21)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel ASR-based method that accurately detects inappropriate pauses in dysarthric speech with a 14.47% error rate.
The methodology integrates a task-specific layer and expert labeling from speech-language pathologists to improve pause detection precision.
Experimental results show robust detection performance across varying dysarthria severities while simultaneously enhancing overall ASR accuracy.

Enhancing Dysarthric Speech Analysis: Inappropriate Pause Detection Using Advanced ASR Models

Introduction to Inappropriate Pause Detection in Dysarthric Speech

Dysarthria, primarily resulting from stroke, significantly impairs an individual's ability to control muscles used for speech, thus affecting their speech intelligibility. Herein, we explore a novel approach aimed at improving automatic detection and assessment of inappropriate pauses in dysarthric speech. This method leverages a large-scale speech recognition model, extending it with a task-specific layer for detecting these pauses, thereby offering substantial support in the domain of speech-language therapy.

Methodology Overview

Unlike traditional methods that predominantly focus on detecting pauses using amplitude thresholds or forced alignment techniques, this paper proposes treating pause detection as a speech recognition problem. This methodology introduces an automatic speech recognition (ASR) model equipped to identify pauses designated as distinct tokens, thus marking a significant pivot from prior pause detection techniques. Key steps in the approach include:

Utilizing an ASR Model for Pause Detection: By inputting speech into the ASR model, text output inclusive of pause tags is produced, essentially treating pause detection as an integrated part of the speech-to-text conversion process.
Labeling Strategy and Model Architecture: A collaboratively developed labeling strategy with speech-language pathologists ensures high-quality data annotation. Task-specific layers, notably an inappropriate pause prediction layer, are appended to the ASR model to facilitate end-to-end detection of inappropriate pauses in dysarthric speech.
Introduction of a Novel Evaluation Metric: A task-tailored metric is conceptualized to evaluate the performance of inappropriate pause detection independently of ASR accuracy, thereby providing a more nuanced insight into the model's efficacy in this specific task.

Experimental Insights

The experiments conducted exhibit a preference for incorporating pause detection directly into the ASR model, highlighting several critical outcomes:

Performance Superiority: The proposed method demonstrates enhanced detection of inappropriate pauses in dysarthric speech across various dysarthria severity levels compared to traditional baseline methods. Notably, the Inappropriate Pause Error Rate stands at 14.47%, marking a significant improvement.
Severability Robustness: The model's performance in identifying inappropriate pauses remains consistent across different levels of dysarthria severity, which is paramount for a model to be practically applied in a clinical setting for providing diagnostics and feedback across the spectrum of dysarthria severity.
ASR Performance Improvement: Incorporating pause detection into the ASR model not only focuses on pause detection accuracy but also yields an improvement in the overall ASR performance. This demonstrates a symbiotic enhancement where addressing specific characteristics of dysarthric speech, such as inappropriate pauses, concurrently benefits broader speech recognition tasks.

Theoretical and Practical Implications

From a theoretical standpoint, this paper proposes an innovative approach to understanding and analyzing dysarthric speech, spotlighting the integration of pause detection within the ASR framework rather than treating it as a separate or subsequent analysis phase. Practically, it provides a scalable and efficient methodology for enhancing speech-language therapy for dysarthric speakers, with the potential for application across different languages and dialects.

Future Directions in AI and Speech Language Pathology

Looking ahead, extending and refining the architecture to accommodate various decoding strategies beyond the specific models tested, such as whisper, could broaden the applicability of this method. Furthermore, collaboration between AI research and speech-language pathology could yield more nuanced and effective tools for diagnosing and treating speech disorders, ultimately contributing to a significant leap in therapeutic outcomes for individuals with dysarthria.

In conclusion, the presented paper offers a substantial leap towards integrating automatic speech recognition technologies with speech disorder therapy, enhancing our capability to detect and assess inappropriate pauses in dysarthric speech efficiently. This advancement stands to significantly bolster the toolkit available for speech-language pathologists, offering a data-driven approach to therapy that is both precise and tailored to the individual needs of patients across the severity spectrum of dysarthria.