Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement (2312.08979v1)
Abstract: Neural network based approaches to speech enhancement have shown to be particularly powerful, being able to leverage a data-driven approach to result in a significant performance gain versus other approaches. Such approaches are reliant on artificially created labelled training data such that the neural model can be trained using intrusive loss functions which compare the output of the model with clean reference speech. Performance of such systems when enhancing real-world audio often suffers relative to their performance on simulated test data. In this work, a non-intrusive multi-metric prediction approach is introduced, wherein a model trained on artificial labelled data using inference of an adversarially trained metric prediction neural network. The proposed approach shows improved performance versus state-of-the-art systems on the recent CHiME-7 challenge \ac{UDASE} task evaluation sets.
- T. Rohdenburg, S. Goetze, V. Hohmann, K.-D. Kammeyer, and B. Kollmeier, “Objective Perceptual Quality Assessment for Self-Steering Binaural Hearing Aid Microphone Arrays,” in Proc. ICASSP, 2008.
- S. Goetze, E. Albertin, J. Rennies, E. Habets, and K.-D. Kammeyer, “Speech Quality Assessment for Listening-Room Compensation,” J. Audio Eng. Soc., vol. 62, no. 6, 2014.
- C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” 2022.
- A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu, “Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio,” 2023.
- G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Interspeech 2021, aug 2021.
- G. Close, W. Ravenscroft, T. Hain, and S. Goetze, “Perceive and predict: self-supervised speech representation based loss functions for speech enhancement,” in Proc. ICASSP 2023, 2023.
- ——, “CMGAN+/+: The University of Sheffield CHiME-7 UDASE Challenge Speech Enhancement System,” 2023.
- S. Leglaive, L. Borne, E. Tzinis, M. Sadeghi, M. Fraticelli, S. Wisdom, M. Pariente, D. Pressnitzer, and J. R. Hershey, “The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement,” 2023.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE ICASSP, 2001.
- Z. Lin, L. Zhou, and X. Qiu, “A composite objective measure on subjective evaluation of speech enhancement algorithms,” Applied Acoustics, vol. 145, 2019.
- S.-W. Fu, C. Yu, K.-H. Hung, M. Ravanelli, and Y. Tsao, “Metricgan-u: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech,” 2021.
- S.-W. Fu, C.-F. Liao, Y. Tsao, and S. de Lin, “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in ICML Proc 2019, 2019.
- R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” in Proc. Interspeech 2022, 2022, pp. 936–940.
- S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Proc. Interspeech 2021, 2021, pp. 201–205.
- G. Close, T. Hain, and S. Goetze, “MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data,” in EUSIPCO 2022, Belgrade, Serbia, Aug. 2022.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” 2020.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - half-baked or well done?” 2018.
- W. Ravenscroft, S. Goetze, and T. Hain, “On data sampling strategies for training neural network speech separation models,” in EUSIPCO 2023, Sep 2023.
- E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
- G. Close, T. Hain, and S. Goetze, “The effect of spoken language on speech enhancement using self-supervised speech representation loss functions,” in Proc. WASPAA, 2023.
- B. Irvin, M. Stamenovic, M. Kegler, and L.-C. Yang, “Self-supervised learning for speech enhancement through synthesis,” in ICASSP 2023, 2023.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in ICASSP 2015, 2015, pp. 5206–5210.
- J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” 2020.
- G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “Wham!: Extending speech separation to noisy environments,” 2019.
- E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo RM -RF: Efficient networks for universal audio source separation,” in MLSP 2020. IEEE, sep 2020.
- E. Tzinis, Y. Adi, V. K. Ithapu, B. Xu, P. Smaragdis, and A. Kumar, “RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing,” IEEE Journal of Selected Topics in Signal Processing, 2022.