Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge (2402.01413v2)

Published 2 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

Citations (6)

View on Semantic Scholar

Summary

The paper presents a comprehensive evaluation of speech enhancement systems for unsupervised domain adaptation.
It compares objective metrics like SDR and DNSMOS with ITU-T P.835 subjective tests to assess performance on real noisy data.
Findings reveal that intrusive metrics reliably rank systems while underscoring the challenge of matching subjective auditory quality.

Objective and Subjective Evaluations in Speech Enhancement

The paper presents a comprehensive analysis of speech enhancement methods evaluated in the Unsupervised Domain Adaptation of Speech Enhancement (UDASE) task of the 7th CHiME challenge. Authors Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, Léonie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, and Daniel Pressnitzer delve into both objective and subjective assessments of systems that participated in the challenge, which focuses on leveraging real-world noisy recordings for unsupervised domain adaptation of speech enhancement models.

The Challenge of Domain Adaptation

In the field of speech enhancement, models tend to be trained on artificially generated datasets composed of clean speech mixed with noise. The gap between these synthetic conditions and the real-world's complexity can lead to decreased performance when encountering environments not reflected in the training data. The UDASE task confronts this challenge, targeting the adaptation of models to real unlabeled noisy speech data, emulating human auditory adaptability.

Methodologies and Findings

For objective evaluation, the paper discusses results obtained from a range of performance metrics. Notably, a clear discrepancy is observed among these metrics when ranking the speech enhancement systems, which underscores the contrast between traditional intrusive metrics like Signal-to-Distortion Ratio (SDR) and nonintrusive metrics like DNSMOS and TorchAudio-Squim. The authors found that while the nonintrusive measures may present generalization issues, intrusive metrics on a synthetic dataset closely mirroring in-domain data—a dataset they call reverberant LibriCHiME-5—provided consistent and reliable rankings.

Subjective evaluation was conducted using ITU-T P.835 methodology, which consists of listening tests designed to capture human perception of speech quality in the presence of background noise. Participants rated the quality of speech, and noise intrusiveness, and provided an overall quality rating for each system. This subjective evaluation yielded insightful results, particularly that only one of the evaluated systems managed to improve overall speech quality over the unprocessed noisy speech, despite all systems effectively reducing background noise.

Correlation Between Objective and Subjective Metrics

The paper inspects the correlation between objective metrics and subjective listening test results, finding that the majority of nonintrusive objective metrics poorly correlated with subjective opinions. However, the DNSMOS background noise intrusiveness metric showed a strong correlation with the equivalent subjective assessment, suggesting it's a reliable measure in certain aspects.

Conclusion

Concluding the paper, the authors highlight that while reverberant LibriCHiME-5 can approximate in-domain performance with traditional intrusive metrics, the full potential of unsupervised domain adaptation for speech enhancement, without clean speech labels, remains a challenging task. This substantial analysis shared with the community serves as a valuable resource for future research in speech enhancement and domain adaptation techniques.

The paper's dataset and the JavaScript experimental platform developed for the listening tests are made publicly available, encouraging replication and further research, and emphasizing the collaborative ethos in advancing the field of speech enhancement.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mlsp4audio/status/1754475712998990120

https://twitter.com/SimonLeglaive/status/1754424841078259774

https://twitter.com/ArxivSound/status/1754526224104206426