NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (2403.03100v3)
Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
- Tacotron: Towards end-to-end speech synthesis. Proc. Interspeech 2017, pages 4006–4010, 2017.
- Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
- FastSpeech: Fast, robust and controllable text to speech. In NeurIPS, 2019.
- NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509, 2023.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. arXiv preprint arXiv:2106.06103, 2021.
- Jets: Jointly training fastspeech2 and hifi-gan for end to end text to speech. arXiv preprint arXiv:2203.16852, 2022.
- Grad-TTS: A diffusion probabilistic model for text-to-speech. arXiv preprint arXiv:2105.06337, 2021.
- Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
- Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022.
- Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
- SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pages 7836–7846. PMLR, 2020.
- AutoVC: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019.
- HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 2020.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
- Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269, 2023.
- Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023.
- Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. arXiv preprint arXiv:2306.07547, 2023.
- Lms with a voice: Spoken language modeling beyond speech tokens. arXiv preprint arXiv:2305.15255, 2023.
- Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. arXiv preprint arXiv:2306.07691, 2023.
- Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. arXiv preprint arXiv:2311.12454, 2023.
- WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Parallel WaveNet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018.
- Char2wav: End-to-end speech synthesis. 2017.
- Deep Voice 3: 2000-speaker neural text-to-speech. Proc. ICLR, pages 214–217, 2018.
- Neural speech synthesis with Transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019.
- Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33, 2020.
- High-fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023.
- Parallel Tacotron: Non-autoregressive and controllable TTS. arXiv preprint arXiv:2010.11439, 2020.
- DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022.
- Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
- Neural analysis and synthesis: Reconstructing speech from self-supervised representations. Advances in Neural Information Processing Systems, 34:16251–16265, 2021.
- Nansy++: Unified voice synthesis with neural analysis and synthesis. arXiv preprint arXiv:2211.09407, 2022.
- Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
- W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
- wav2vec: Unsupervised pre-training for speech recognition. Proc. Interspeech 2019, pages 3465–3469, 2019.
- Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Disentangled feature learning for real-time neural speech coding. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- AdaSpeech: Adaptive text to speech for custom voice. In International Conference on Learning Representations, 2021.
- Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
- Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion. arXiv preprint arXiv:2208.08757, 2022.
- Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
- Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
- Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- Improved masked image generation with token-critic. In European Conference on Computer Vision, pages 70–86. Springer, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5404–5411, 2024.
- Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020.
- LibriSpeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
- The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
- Nonparallel expressive tts for unseen target speaker using style-controlled adaptive layer and optimized pitch embedding. In 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 176–181. IEEE, 2023.
- Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts. arXiv preprint arXiv:2307.07218, 2023.
- Diffprosody: Diffusion-based latent prosody generation for expressive speech synthesis with prosody conditional adversarial training. arXiv preprint arXiv:2307.16549, 2023.
- Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7577–7581. IEEE, 2022.
- Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023.
- Token-level ensemble distillation for grapheme-to-phoneme conversion. In INTERSPEECH, 2019.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
- Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022.
- Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
- Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6309–6318, 2017.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
- Lm-vc: Zero-shot voice conversion via speech generation based on language models. arXiv preprint arXiv:2306.10521, 2023.
- Zeqian Ju (13 papers)
- Yuancheng Wang (22 papers)
- Kai Shen (29 papers)
- Xu Tan (164 papers)
- Detai Xin (15 papers)
- Dongchao Yang (51 papers)
- Yanqing Liu (48 papers)
- Yichong Leng (27 papers)
- Kaitao Song (46 papers)
- Siliang Tang (117 papers)
- Zhizheng Wu (45 papers)
- Tao Qin (201 papers)
- Xiang-Yang Li (77 papers)
- Wei Ye (110 papers)
- Shikun Zhang (82 papers)
- Jiang Bian (229 papers)
- Lei He (121 papers)
- Jinyu Li (164 papers)
- Sheng Zhao (75 papers)