Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Published 26 Nov 2019 in eess.AS | (1911.11601v1)

Abstract: We investigate a novel cross-lingual multi-speaker text-to-speech synthesis approach for generating high-quality native or accented speech for native/foreign seen/unseen speakers in English and Mandarin. The system consists of three separately trained components: an x-vector speaker encoder, a Tacotron-based synthesizer and a WaveNet vocoder. It is conditioned on 3 kinds of embeddings: (1) speaker embedding so that the system can be trained with speech from many speakers will little data from each speaker; (2) language embedding with shared phoneme inputs; (3) stress and tone embedding which improves naturalness of synthesized speech, especially for a tonal language like Mandarin. By adjusting the various embeddings, MOS results show that our method can generate high-quality natural and intelligible native speech for native/foreign seen/unseen speakers. Intelligibility and naturalness of accented speech is low as expected. Speaker similarity is good for native speech from native speakers. Interestingly, speaker similarity is also good for accented speech from foreign speakers. We also find that normalizing speaker embedding x-vectors by L2-norm normalization or whitening improves output quality a lot in many cases, and the WaveNet performance seems to be language-independent: our WaveNet is trained with Cantonese speech and can be used to generate Mandarin and English speech very well.

Abstract PDF Upgrade to Chat

Citations (26)

View on Semantic Scholar

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections