Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 74 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention (2312.08676v2)

Published 14 Dec 2023 in cs.SD, cs.CL, and eess.AS

Abstract: Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embedding free voice conversion model, which is designed to learn and incorporate speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism, and then reconstruct waveform from HuBERT semantic tokens in a non-autoregressive manner. The concise design of SEF-VC enhances its training stability and voice conversion performance. Objective and subjective evaluations demonstrate the superiority of SEF-VC to generate high-quality speech with better similarity to target reference than strong zero-shot VC baselines, even for very short reference speeches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in APSIPA.   IEEE, 2016, pp. 1–6.
  2. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in Proc. ICML.   PMLR, 2019, pp. 5210–5219.
  3. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in Proc. ICML.   PMLR, 2020, pp. 7836–7846.
  4. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in Proc. IEEE ICASSP, 2022, pp. 6332–6336.
  5. Q. Wang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Drvc: A framework of any-to-any voice conversion with self-supervised learning,” in Proc. IEEE ICASSP, 2022, pp. 3184–3188.
  6. S. Wang and D. Borth, “Noisevc: Towards high quality zero-shot voice conversion,” arXiv preprint arXiv:2104.06074, 2021.
  7. D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” Proc. NeurIPS, vol. 31, 2018.
  8. E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. C. Jr., A. da Silva Soares, S. M. Aluísio, and M. A. Ponti, “Sc-glowtts: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech 2021, pp. 3645–3649.
  9. C. Du, Y. Guo, X. Chen, and K. Yu, “Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature,” IEEE/ACM Trans. ASLP., 2023.
  10. E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in Proc. ICML.   PMLR, 2022, pp. 2709–2720.
  11. A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. ICLR, 2020.
  12. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. ASLP., vol. 29, pp. 3451–3460, 2021.
  13. A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W. Hsu, A. Mohamed, and E. Dupoux, “Speech resynthesis from discrete disentangled self-supervised representations,” in Proc. Interspeech 2021, pp. 3615–3619.
  14. J. Lin, Y. Y. Lin, C. Chien, and H. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Proc. Interspeech 2021, pp. 836–840.
  15. T. Dang, D. Tran, P. Chin, and K. Koishida, “Training robust zero-shot voice conversion models with self-supervised features,” in Proc. IEEE ICASSP, 2022, pp. 6557–6561.
  16. W.-C. Huang, Y.-C. Wu, and T. Hayashi, “Any-to-one sequence-to-sequence voice conversion using self-supervised discrete speech representations,” in Proc. IEEE ICASSP, 2021, pp. 5944–5948.
  17. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018, pp. 5329–5333.
  18. Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in Proc. IEEE ICASSP, 2018, pp. 5274–5278.
  19. J. Lian, C. Zhang, G. K. Anumanchipalli, and D. Yu, “Towards Improved Zero-shot Voice Conversion with Conditional DSVAE,” in Proc. Interspeech 2022, pp. 2598–2602.
  20. J. Lian, C. Zhang, and D. Yu, “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in Proc. IEEE ICASSP, 2022, pp. 6572–6576.
  21. R. Xiao, H. Zhang, and Y. Lin, “Dgc-vector: A new speaker embedding for zero-shot voice conversion,” in Proc. IEEE ICASSP, 2022, pp. 6547–6551.
  22. Z. Tan, J. Wei, J. Xu, Y. He, and W. Lu, “Zero-shot voice conversion with adjusted speaker embeddings and simple acoustic features,” in Proc. IEEE ICASSP, 2021, pp. 5964–5968.
  23. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Trans. ASLP., 2023.
  24. Z. Wang, Y. Chen, L. Xie, Q. Tian, and Y. Wang, “Lm-vc: Zero-shot voice conversion via speech generation based on language models,” arXiv preprint arXiv:2306.10521, 2023.
  25. R. Huang, C. Zhang, Y. Wang, D. Yang, L. Liu, Z. Ye, Z. Jiang, C. Weng, Z. Zhao, and D. Yu, “Make-a-voice: Unified voice synthesis with discrete representation,” arXiv preprint arXiv:2305.19269, 2023.
  26. C. Du, Y. Guo, F. Shen, Z. Liu, Z. Liang, X. Chen, S. Wang, H. Zhang, and K. Yu, “Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding,” arXiv preprint arXiv:2306.07547, 2023.
  27. C. Du, Y. Guo, X. Chen, and K. Yu, “VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature,” in Proc. Interspeech, 2022, pp. 1596–1600.
  28. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
  29. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, pp. 5036–5040.
  30. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” 2019.
  31. J. chieh Chou and H.-Y. Lee, “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech 2019, pp. 664–668.
  32. S. Hussain, P. Neekhara, J. Huang, J. Li, and B. Ginsburg, “Ace-vc: Adaptive and controllable voice conversion using explicitly disentangled self-supervised speech representations,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
  33. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in Proc. IEEE ASRU.   IEEE Signal Processing Society, 2011.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 18 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube