Remixed2Remixed: Domain adaptation for speech enhancement by Noise2Noise learning with Remixing (2312.16836v1)
Abstract: This paper proposes Remixed2Remixed, a domain adaptation method for speech enhancement, which adopts Noise2Noise (N2N) learning to adapt models trained on artificially generated (out-of-domain: OOD) noisy-clean pair data to better separate real-world recorded (in-domain) noisy data. The proposed method uses a teacher model trained on OOD data to acquire pseudo-in-domain speech and noise signals, which are shuffled and remixed twice in each batch to generate two bootstrapped mixtures. The student model is then trained by optimizing an N2N-based cost function computed using these two bootstrapped mixtures. As the training strategy is similar to the recently proposed RemixIT, we also investigate the effectiveness of N2N-based loss as a regularization of RemixIT. Experimental results on the CHiME-7 unsupervised domain adaptation for conversational speech enhancement (UDASE) task revealed that the proposed method outperformed the challenge baseline system, RemixIT, and reduced the blurring of performance caused by teacher models.
- P. Ochieng, “Deep neural network techniques for monaural speech enhancement: State of the art analysis,” arXiv preprint arXiv:2212.00369, 2022.
- C. Macartney, and T. Weyde, “Improved speech enhancement with the wave-u-net,” arXiv preprint arXiv:1811.11307, 2018.
- A. Défossez, G. Synnaeve, and Y. Adi, “Real Time Speech Enhancement in the Waveform Domain,” in Proc. Interspeech, pp. 3291–3295, 2020.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. ASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
- E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo RM -RF: Efficient networks for universal audio source separation,” in Proc. MLSP, pp. 1–6, 2020.
- S. Zhao, T. H. Nguyen, and B. Ma, “Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses,” in Proc. ICASSP, pp. 6648–6652, 2021.
- N. Ito and M. Sugiyama, “Audio Signal Enhancement with Learning from Positive and Unlabeled Data,” in Proc. ICASSP, pp. 1–5, 2023.
- A. S. Subramanian, X. Wang, M. K. Baskar, S. Watanabe, T. Taniguchi, D. Tran, and Y. Fujita, “Speech enhancement using end-to-end speech recognition objectives,” in Proc. WASPAA, pp. 234–238, 2019.
- S. W. Fu, C. Yu, K. H. Hung, M. Ravanelli, and Y. Tsao, “MetricGAN-U: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech,” in Proc. ICASSP, pp. 7412–7416, 2022.
- S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. Hershey, “Unsupervised sound separation using mixture invariant training,” in proc. Adv. NIPS, 33, pp. 3846–3857, 2020.
- K. Saijo, and T. Ogawa, “Self-Remixing: Unsupervised Speech Separation VIA Separation and Remixing,” in Proc. ICASSP, pp. 1–5, 2023.
- C. F. Liao, Y. Tsao, H. Y. Lee, and H. M. Wang, “Noise Adaptive Speech Enhancement Using Domain Adversarial Training,” in Proc. Interspeech, pp. 3148–3152, 2019.
- H. Y. Lin, H. H. Tseng, X. Lu, and Y. Tsao, “Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport,” in Proc. Adv. NIPS, 34, pp. 19935–19946, 2021.
- E. Tzinis, Y. Adi, V. K. Ithapu, B. Xu, P. Smaragdis, and A. Kumar, A, “Remixit: Continual self-training of speech enhancement models via bootstrapped remixing,” IEEE JSTSP, vol. 16, no. 6, pp. 1329–1341, 2022.
- J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learning Image Restoration without Clean Data,” in Proc. PMLR pp. 2965–2974, 2018.
- M. M. Kashyap, A. Tambwekar, K. Manohara, and S. Natarajan, “Speech Denoising Without Clean Training Data: A Noise2Noise Approach,” in Proc. Interspeech, pp. 2716–2720, 2021.
- N. Moran, D. Schmidt, Y. Zhong, and P. Coady, “Noisier2noise: Learning to denoise from unpaired noisy data,” in Proc. CVPR, pp. 12064–12072, 2020.
- T. Pang, H. Zheng, Y, Quan, and H. Ji, “Recorrupted-to-recorrupted: Unsupervised deep learning for image denoising,” in Proc. CVPR, pp. 2043–2052, 2021.
- T. Fujimura, Y. Koizumi, K. Yatabe, and R. Miyazaki, “Noisy-target training: A training strategy for DNN-based speech enhancement without clean speech,” in Proc. EUSIPCO, pp. 436–440, 2021.
- A. Sivaraman, S. Kim, and M. Kim, “Personalized speech enhancement through self-supervised data augmentation and purification,” in Proc. Interspeech, pp. 2676–2680, 2021.
- T. Fujimura and T. Toda, “Analysis Of Noisy-Target Training For Dnn-Based Speech Enhancement,” in Proc. ICASSP, pp. 1–5, 2023.
- S. Leglaive, L. Borne, E. Tzinis, M. Sadeghi, M. Fraticelli, S. Wisdom, M. Pariente, D. Pressnitzer, and J. R. Hershey, “The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement,” arXiv preprint arXiv:2307.03533, 2023.
- Website of CHiME-7 Task 2 UDASE: https://www.chimechallenge.org/current/task2/index (last access: Sep. 4, 2023)
- J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An open-source data set for generalizable speech separation,” arXiv preprint arXiv:2005.11262,2020.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, pp. 5206–5210, 2015.
- G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. LeRoux, “WHAM!: Extending speech separation to noisy environments,” in Proc. Interspeech, pp. 1368–1372, 2019.
- : J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in Proc. Interspeech, pp. 1561–1565, 2018.
- J. LeRoux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. ICASSP, pp. 626–630, 2019.
- C. K. Reddy, V. Gopal, and R. Cutler, “DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in Proc. ICASSP, pp. 886–890, 2022.