HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling (2403.05989v1)
Abstract: Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.
- The K-Means Algorithm: A Comprehensive Survey And Performance Evaluation. Electronics, 9(8):1295.
- AudioLM: A Language Modeling Approach to Audio Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533.
- Language Models Are Few-Shot Learners. Advances in neural information processing systems, 33:1877–1901.
- AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pages 1–5. IEEE.
- High Fidelity Neural Audio Compression. arXiv preprint arXiv:2210.13438.
- NICE: Non-linear Independent Components Estimation.
- Generative Adversarial Networks. Advances in neural information processing systems, 27.
- Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Denoising Diffusion Probabilistic Models. Advances in neural information processing systems, 33:6840–6851.
- Hubert: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. In Interspeech.
- Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts. arXiv preprint arXiv:2307.07218.
- Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. Advances in Neural Information Processing Systems, 33:8067–8077.
- Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
- Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).
- Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014.
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, 33:17022–17033.
- DiffWave: A Versatile Diffusion Model for Audio Synthesis. In 9th International Conference on Learning Representations, ICLR 2021.
- Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis. In Interspeech.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML) 2023. JMLR.
- Neural Speech Synthesis with Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, page 6706–6713.
- Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
- XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. In Interspeech.
- Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. In International Conference on Machine Learning, pages 8599–8608. PMLR.
- Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org.
- Language Models Are Unsupervised Multitask Learners. OpenAI blog, 1(8):9.
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In 9th International Conference on Learning Representations, ICLR 2021.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
- NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv preprint arXiv:2304.09116.
- Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. arXiv preprint arXiv:2401.07333.
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res., 15:1929–1958.
- A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561.
- Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5):1234–1252.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.
- WaveNet: A Generative Model for Raw Audio. In The 9th ISCA Speech Synthesis Workshop, page 125.
- Neural Discrete Representation Learning. Advances in neural information processing systems, 30.
- Attention Is All You Need. Advances in neural information processing systems, 30.
- Audiobox: Unified Audio Generation with Natural Language Prompts. arXiv preprint arXiv:2312.15821.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
- HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation. arXiv preprint arXiv:2210.12740.
- Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network. In Proc. Interspeech 2023, pages 5401–5405.
- LauraGPT: Listen, attend, understand, and regenerate audio with GPT. arXiv preprint arXiv:2310.04673.
- Crosssinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–6. IEEE.
- Tacotron: Towards End-to-End Speech Synthesis. In Interspeech.
- ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, pages 2207–2211.
- InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt. arXiv preprint arXiv:2301.13662.
- HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec. arXiv preprint arXiv:2305.02765.
- Soundstream: An End-to-End Neural Audio Codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
- Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization. Advances in Neural Information Processing Systems, 32.
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. arXiv preprint arXiv:2305.11000.