Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model (2405.09768v1)
Abstract: Recent advances in generative LLMing applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech LLMs (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
- James Betker. 2023. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636.
- A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12644–12652.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- Alain De Cheveigné and Hideki Kawahara. 2002. Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917–1930.
- High fidelity neural audio compression. Transactions on Machine Learning Research.
- Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Proc. Interspeech 2020. International Speech Communication Association.
- International Telecommunication Union, Telecommunication Standardization Sector. 1996. Methods for subjective determination of transmission quality. ITU Recommendation ITU-T P.800.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
- Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
- Freevc: Towards high-quality text-free one-shot voice conversion. In Proc. ICASSP, pages 1–5. IEEE.
- Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv preprint arXiv:2309.07937.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
- Suno-AI. 2023. Bark: Text-prompted generative audio model. https://github.com/suno-ai/bark.
- Jason Taylor and Korin Richmond. 2021. Confidence intervals for asr-based tts evaluation. In Proc. Interspeech, pages 2791–2795. International Speech Communication Association.
- It’s not what you said, it’s how you said it: discriminative perception of speech as a multichannel communication system. In Proc. Interspeech 2021. International Speech Communication Association.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
- Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
- Speechx: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873.
- Dailydialog: A manually labelled multi-turn dialogue dataset. In Proc. of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995.
- CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15.
- LibriTTS: A corpus derived from librispeech for text-to-speech. In Proc. Interspeech.