Generalization Ability of MOS Prediction Networks

Published 6 Oct 2021 in eess.AS | (2110.02635v3)

Abstract: Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to generalize well to new listening test contexts. In this paper, using a variety of networks for MOS prediction including MOSNet and self-supervised speech models such as wav2vec2, we investigate their performance on data from different listening tests in both zero-shot and fine-tuned settings. We find that wav2vec2 models fine-tuned for MOS prediction have good generalization capability to out-of-domain data even for the most challenging case of utterance-level predictions in the zero-shot setting, and that fine-tuning to in-domain data can improve predictions. We also observe that unseen systems are especially challenging for MOS prediction models.

Abstract PDF Upgrade to Chat

Citations (133)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuned self-supervised models like wav2vec2 achieve robust MOS predictions with strong correlation metrics.
The paper reveals that unseen systems with high variability, particularly those with single rater data, present significant challenges for MOS prediction.
The paper finds that augmenting training data with speed and silence transformations improves MOSNet-based model performance on diverse listening tests.

Generalization Ability of MOS Prediction Networks

The paper "Generalization ability of MOS prediction networks" presents an in-depth exploration into the complex issue of automatically predicting Mean Opinion Scores (MOS) for synthesized speech. Given the considerable variability and subjective nature of human auditory perception, developing robust automatic MOS prediction systems remains an unsolved problem. The authors focus on investigating the generalization capabilities of various network architectures trained for MOS prediction, highlighting challenges faced when applying these models across diverse listening test contexts.

Key Contributions

The study employs a rigorous experimental framework using a variety of models such as MOSNet and self-supervised learning frameworks like wav2vec2 to assess their capacity in predicting MOS under different conditions, particularly across out-of-domain data. The researchers approach this by leveraging datasets from diverse listening tests, some of which include new speakers, systems, listeners, and texts, to challenge the generalization ability of MOS predictors.

Experimental Methodology

In the paper, the authors investigate several models trained and fine-tuned on a comprehensive in-domain dataset (BVCC), comprising a variety of existing speech synthesis samples. They further test the models on out-of-domain datasets collected from previous listening tests, each varying in language, sample diversity, and listener demographics. The evaluation metrics employed include mean squared error (MSE), linear correlation coefficient (LCC), Spearman rank correlation coefficient (SRCC), and Kendall Tau rank correlation (KTAU).

Significant Findings

Model Performance: Fine-tuned self-supervised models (wav2vec2 and HuBERT) demonstrated strong performance in the MOS prediction task. Notably, wav2vec2 models exhibited good generalization capabilities and strong correlation metrics even in zero-shot scenarios, with the best results when fine-tuned on in-domain data.
Challenges with Unseen Systems: Unseen systems posed significant challenges across the datasets. For the ASV2019 dataset, where individual utterances often have a single rater resulting in high variability, the complexity of unseen system generalization was further highlighted.
Data Augmentation: The paper reports improvements when augmenting data with speed and silence transformations during model training, particularly for the MOSNet-based architectures.

Implications and Future Directions

The implications of this research are twofold. Practically, it showcases how fine-tuning self-supervised models on smaller, task-specific datasets can yield robust MOS predictions, potentially streamlining the evaluation process for speech synthesis systems. Theoretically, it sets a foundation for further exploratory work into model architectures and datasets that capture the nuances of human auditory perception better.

Moving forward, research could benefit from addressing the inherent difficulty in predicting MOS for unseen systems by examining more sophisticated modeling techniques or leveraging additional linguistic and contextual features. Moreover, exploring better domain adaptation strategies could improve generalization in broader contexts, fostering advancements in AI-driven speech evaluation technologies.

The authors have significantly contributed to the understanding of how MOS prediction networks can be trained for enhanced generalization, laying a groundwork upon which future innovations can be built.

Markdown Report Issue