Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023 (2310.05203v1)

Published 8 Oct 2023 in eess.AS, cs.CL, cs.LG, cs.SD, and eess.SP

Abstract: This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a diffusion-based any-to-any voice conversion model using publicly available large-scale 750 hours of speech and singing data. Then, we finetune the model for each target singer/speaker of Task 1 and Task 2. Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC (Task 2), which implies the generalization ability of our proposed method. Our objective evaluation results show that using large datasets is particularly beneficial for cross-domain SVC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. “Singing voice conversion with non-parallel data” In Proc. MIPR, 2019, pp. 292–296
  2. “Unsupervised Cross-Domain Singing Voice Conversion” In Proc. Interspeech, 2020, pp. 801–805
  3. “FastSVC: Fast cross-domain singing voice conversion with feature-wise linear modulation” In Proc. ICME, 2021, pp. 1–6
  4. “Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding” In Proc. Interspeech, 2022, pp. 4287–4291
  5. “Self-Supervised Representations for Singing Voice Conversion” In Proc. ICASSP, 2023, pp. 1–5
  6. “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods” In Proc. Odyssey, 2018, pp. 195–202
  7. “Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion” In Proc. Joint Workshop for the BC and VCC, 2020, pp. 80–98
  8. “Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis” In Proc. Interspeech, 2022, pp. 4242–4246
  9. “M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus” In Proc. NeurIPS 35, 2022, pp. 6914–6926
  10. “Tohoku Kiritan singing database: A singing database for statistical parametric singing synthesis using Japanese pop songs” In Acoustical Science and Technology 42.3, 2021, pp. 140–145
  11. “JVS-MuSiC: Japanese multispeaker singing-voice corpus” In arXiv preprint arXiv:2001.07044, 2020
  12. “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech” In Proc. APSIPA ASC, 2013, pp. 1–9
  13. Junya Koguchi, Shinnosuke Takamichi and Masanori Morise “PJS: Phoneme-balanced Japanese singing-voice corpus” In Proc. APSIPA ASC, 2020, pp. 487–491
  14. “NHSS: A speech and singing parallel database” In Speech Communication 133, 2021, pp. 9–22
  15. “Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus” In Proc. ACM ICM, 2021, pp. 3945–3954
  16. “LibriSpeech: an ASR corpus based on public domain audio books” In Proc. ICASSP, 2015, pp. 5206–5210
  17. “The Singing Voice Conversion Challenge 2023” In arXiv preprint arXiv:2306.14422, 2023
  18. “CONTENTVEC: An improved self-supervised speech representation by disentangling speakers” In Proc. ICML, 2022, pp. 18003–18017
  19. “A comparison of discrete and soft speech units for improved voice conversion” In Proc. ICASSP, 2022, pp. 6562–6566
  20. “A comparative study of self-supervised speech representation based voice conversion” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1308–1318
  21. Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Proc. NeurIPS 33, 2020, pp. 6840–6851
  22. “DiffSVC: A diffusion probabilistic model for singing voice conversion” In Proc. ASRU, 2021, pp. 741–748
  23. “Neural analysis and synthesis: Reconstructing speech from self-supervised representations” In Proc. NeurIPS 34, 2021, pp. 16251–16265
  24. Reo Yoneyama, Yi-Chiao Wu and Tomoki Toda “Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder” In Proc. ICASSP, 2023, pp. 1–5
  25. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 29, 2021, pp. 3451–3460
  26. Masanori Morise “Harvest: A high-performance fundamental frequency estimator from speech signals” In Proc. Interspeech, 2017, pp. 2321–2325
  27. Masanori Morise “D4C, a band-aperiodicity estimator for high-quality speech synthesis” In Speech Communication 84, 2016, pp. 57–65
  28. “Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron” In Proc. ICML, 2018, pp. 4693–4702
  29. “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis” In Proc. ICML, 2018, pp. 5180–5189
  30. “Adaspeech: Adaptive text to speech for custom voice” In Proc. ICLR, 2021
  31. “Classifier-free diffusion guidance” In Proc. NeurIPS, 2021
  32. Sungwon Kim, Heeseung Kim and Sungroh Yoon “Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data” In arXiv preprint arXiv:2205.15370, 2022
  33. “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis” In Proc. Interspeech, 2022, pp. 833–837
  34. “NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis” In Proc. ICLR, 2022
  35. Jungil Kong, Jaehyeon Kim and Jaekyoung Bae “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis” In Proc. NeurIPS 33, 2020, pp. 17022–17033
  36. Christophe Veaux, Junichi Yamagishi and Kirsten MacDonald “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit” In University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017 DOI: 10.7488/ds/2645
  37. “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech” In Proc. Interspeech, 2019, pp. 1526–1530
  38. “DiffSinger: Singing voice synthesis via shallow diffusion mechanism” In AAAI 36.10, 2022, pp. 11020–11028
  39. “Children’s song dataset for singing voice research” In Proc. ISMIR, 2020
  40. Jiatong Shi “KiSing: the First Open-source Mandarin Singing Voice Synthesis Corpus” Accessed: 2023.07.16, http://shijt.site/index.php/2021/05/16/
  41. Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis” In arXiv preprint arXiv:1711.00354, 2017
  42. “Decoupled weight decay regularization” In Proc. ICLR, 2019
  43. “BigVGAN: A universal neural vocoder with large-scale training” In Proc. ICLR, 2023
  44. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In Proc. ICLR, 2015
  45. “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022” In Proc. Interspeech, 2022, pp. 4521–4525
  46. “WavLM: Large-scale self-supervised pre-training for full stack speech processing” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1505–1518
  47. “Robust Speech Recognition via Large-Scale Weak Supervision” In Proc. ICML 202, 2023, pp. 28492–28518
Citations (6)

Summary

We haven't generated a summary for this paper yet.