A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023 (2310.05203v1)
Abstract: This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a diffusion-based any-to-any voice conversion model using publicly available large-scale 750 hours of speech and singing data. Then, we finetune the model for each target singer/speaker of Task 1 and Task 2. Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC (Task 2), which implies the generalization ability of our proposed method. Our objective evaluation results show that using large datasets is particularly beneficial for cross-domain SVC.
- “Singing voice conversion with non-parallel data” In Proc. MIPR, 2019, pp. 292–296
- “Unsupervised Cross-Domain Singing Voice Conversion” In Proc. Interspeech, 2020, pp. 801–805
- “FastSVC: Fast cross-domain singing voice conversion with feature-wise linear modulation” In Proc. ICME, 2021, pp. 1–6
- “Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding” In Proc. Interspeech, 2022, pp. 4287–4291
- “Self-Supervised Representations for Singing Voice Conversion” In Proc. ICASSP, 2023, pp. 1–5
- “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods” In Proc. Odyssey, 2018, pp. 195–202
- “Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion” In Proc. Joint Workshop for the BC and VCC, 2020, pp. 80–98
- “Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis” In Proc. Interspeech, 2022, pp. 4242–4246
- “M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus” In Proc. NeurIPS 35, 2022, pp. 6914–6926
- “Tohoku Kiritan singing database: A singing database for statistical parametric singing synthesis using Japanese pop songs” In Acoustical Science and Technology 42.3, 2021, pp. 140–145
- “JVS-MuSiC: Japanese multispeaker singing-voice corpus” In arXiv preprint arXiv:2001.07044, 2020
- “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech” In Proc. APSIPA ASC, 2013, pp. 1–9
- Junya Koguchi, Shinnosuke Takamichi and Masanori Morise “PJS: Phoneme-balanced Japanese singing-voice corpus” In Proc. APSIPA ASC, 2020, pp. 487–491
- “NHSS: A speech and singing parallel database” In Speech Communication 133, 2021, pp. 9–22
- “Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus” In Proc. ACM ICM, 2021, pp. 3945–3954
- “LibriSpeech: an ASR corpus based on public domain audio books” In Proc. ICASSP, 2015, pp. 5206–5210
- “The Singing Voice Conversion Challenge 2023” In arXiv preprint arXiv:2306.14422, 2023
- “CONTENTVEC: An improved self-supervised speech representation by disentangling speakers” In Proc. ICML, 2022, pp. 18003–18017
- “A comparison of discrete and soft speech units for improved voice conversion” In Proc. ICASSP, 2022, pp. 6562–6566
- “A comparative study of self-supervised speech representation based voice conversion” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1308–1318
- Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Proc. NeurIPS 33, 2020, pp. 6840–6851
- “DiffSVC: A diffusion probabilistic model for singing voice conversion” In Proc. ASRU, 2021, pp. 741–748
- “Neural analysis and synthesis: Reconstructing speech from self-supervised representations” In Proc. NeurIPS 34, 2021, pp. 16251–16265
- Reo Yoneyama, Yi-Chiao Wu and Tomoki Toda “Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder” In Proc. ICASSP, 2023, pp. 1–5
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 29, 2021, pp. 3451–3460
- Masanori Morise “Harvest: A high-performance fundamental frequency estimator from speech signals” In Proc. Interspeech, 2017, pp. 2321–2325
- Masanori Morise “D4C, a band-aperiodicity estimator for high-quality speech synthesis” In Speech Communication 84, 2016, pp. 57–65
- “Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron” In Proc. ICML, 2018, pp. 4693–4702
- “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis” In Proc. ICML, 2018, pp. 5180–5189
- “Adaspeech: Adaptive text to speech for custom voice” In Proc. ICLR, 2021
- “Classifier-free diffusion guidance” In Proc. NeurIPS, 2021
- Sungwon Kim, Heeseung Kim and Sungroh Yoon “Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data” In arXiv preprint arXiv:2205.15370, 2022
- “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis” In Proc. Interspeech, 2022, pp. 833–837
- “NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis” In Proc. ICLR, 2022
- Jungil Kong, Jaehyeon Kim and Jaekyoung Bae “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis” In Proc. NeurIPS 33, 2020, pp. 17022–17033
- Christophe Veaux, Junichi Yamagishi and Kirsten MacDonald “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit” In University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017 DOI: 10.7488/ds/2645
- “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech” In Proc. Interspeech, 2019, pp. 1526–1530
- “DiffSinger: Singing voice synthesis via shallow diffusion mechanism” In AAAI 36.10, 2022, pp. 11020–11028
- “Children’s song dataset for singing voice research” In Proc. ISMIR, 2020
- Jiatong Shi “KiSing: the First Open-source Mandarin Singing Voice Synthesis Corpus” Accessed: 2023.07.16, http://shijt.site/index.php/2021/05/16/
- Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis” In arXiv preprint arXiv:1711.00354, 2017
- “Decoupled weight decay regularization” In Proc. ICLR, 2019
- “BigVGAN: A universal neural vocoder with large-scale training” In Proc. ICLR, 2023
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In Proc. ICLR, 2015
- “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022” In Proc. Interspeech, 2022, pp. 4521–4525
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1505–1518
- “Robust Speech Recognition via Large-Scale Weak Supervision” In Proc. ICML 202, 2023, pp. 28492–28518