Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions (2307.14502v2)

Published 27 Jul 2023 in eess.AS, cs.LG, and cs.SD

Abstract: Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the self-supervised representation and that used to train the SE system. Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly. This may lead to enhancement systems which are language specific and as such do not generalise well to unseen languages, unlike models trained using traditional spectrogram or time domain loss functions. In this work, SE models are trained and tested on a number of different languages, with self-supervised representations which themselves are trained using different language combinations and with differing network structures as loss function representations. These models are then tested across unseen languages and their performances are analysed. It is found that the training language of the self-supervised representation appears to have a minor effect on enhancement performance, the amount of training data of a particular language, however, greatly affects performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. S. Doclo, W. Kellermann, S. Makino, and S. E. Nordholm, “Multichannel signal enhancement algorithms for assisted listening devices: Exploiting spatial diversity using multiple microphones,” IEEE Signal Processing Magazine, vol. 32, no. 2, 2015.
  2. N. Moritz, K. Adiloğlu, J. Anemüller, S. Goetze, and B. Kollmeier, “Multi-channel speech enhancement and amplitude modulation analysis for noise robust automatic speech recognition,” Computer Speech & Language, vol. 46, November 2017.
  3. R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-field automatic speech recognition,” Proceedings of the IEEE, vol. 109, no. 2, 2021.
  4. S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “Metricgan+: An improved version of metricgan for speech enhancement,” 2021.
  5. M. Maciejewski, G. Wichern, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in ICASSP 2020, May 2020.
  6. X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watanabe, “MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition,” ASRU 2019, October 2019.
  7. G. Close, T. Hain, and S. Goetze, “PAMGAN+/-: Improving Phase-Aware Speech Enhancement Performance via Expanded Discriminator Training,” in AES Convention Europe 2023, 2023.
  8. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP 2001, vol. 2, 2001.
  9. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, 2011.
  10. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP 2019, May 2019.
  11. A. Avila, B. Cauchi, S. Goetze, S. Doclo, and T. Falk, “Performance comparison of intrusive and non-intrusive instrumental quality measures for enhanced speech,” in 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), 2016.
  12. G. Close, S. Hollands, T. Hain, and S. Goetze, “Non-intrusive Speech Intelligibility Metric Prediction for Hearing Impaired Individuals for the Clarity Prediction Challenge 1,” in Proc. Interspeech 2022, 2022.
  13. B. Cauchi, K. Siedenburg, J. F. Santos, T. H. Falk, S. Doclo, and S. Goetze, “Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, 2019.
  14. C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” 2021.
  15. C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, “MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion,” in Proc. Interspeech 2019, 2019.
  16. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
  17. W.-N. Hsu, B. Bolte, Y.-H. Hubert Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
  18. S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021.
  19. G. Close, W. Ravenscroft, T. Hain, and S. Goetze, “Perceive and predict: Self-supervised speech representation based loss functions for speech enhancement,” in ICASSP 2023, 2023.
  20. O. Tal, M. Mandel, F. Kreuk, and Y. Adi, “A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement,” in Proc. Interspeech 2022, 2022.
  21. T.-A. Hsieh, C. Yu, S.-W. Fu, X. Lu, and Y. Tsao, “Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement,” in Interspeech’21, 2021.
  22. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  23. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of ACL 2019, Minneapolis, 2019.
  24. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in ICASSP 2015, 2015.
  25. A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, Y. Adi, J. Pino, J. Gu, and W.-N. Hsu, “Textless speech-to-speech translation on real data,” in ACL, Seattle, United States, July 2022.
  26. C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in ACL Proceedings.   Association for Computational Linguistics, Aug. 2021.
  27. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in Proc. Interspeech 2022, 2022.
  28. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,” in ISCA Speech Synthesis Workshop, 2016.
  29. J. Thiemann, N. Ito, and E. Vincent, “DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments,” June 2013.
  30. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proc. Conf. on Language Resources and Evaluation, 2020.
  31. M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  33. G. Close, T. Hain, and S. Goetze, “MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data,” in EUSIPCO 2022, Belgrade, Serbia, Aug. 2022.
  34. Z. Lin, L. Zhou, and X. Qiu, “A composite objective measure on subjective evaluation of speech enhancement algorithms,” Applied Acoustics, vol. 145, 02 2019.
  35. G. Bella, K. Batsuren, and F. Giunchiglia, “A database and visualization of the similarity of contemporary lexicons,” in Text, Speech, and Dialogue, K. Ekštein, F. Pártl, and M. Konopík, Eds.   Cham: Springer International Publishing, 2021.
  36. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Topics in Signal Processing, vol. 16, 10 2022.
Citations (6)

Summary

We haven't generated a summary for this paper yet.