Emergent Mind

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

(2402.08093)
Published Feb 12, 2024 in cs.LG , cs.CL , and eess.AS

Abstract

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of LLMs when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

BASE TTS overview: Speech tokenizer learns discrete representation, modeled with text and reference speech, converted into waveform.

Overview

  • BASE TTS introduces cutting-edge Text-to-Speech technology with a billion-parameter model trained on 100,000 hours of speech, aiming for human-like speech synthesis.

  • The paper presents three main contributions: the largest TTS model to date, emergent abilities in complex prosody and textual nuances, and novel speech representations through speaker-disentangled speechcodes.

  • Technical innovations include an LLM-based approach with Transformer architecture and a novel speechcode tokenization for efficient waveform synthesis.

  • It explores theoretical implications of scaling TTS models and future directions for improving syntactic and emotional expression, while addressing challenges like hallucinations and bias.

Building a Billion-Parameter Text-to-Speech Model: Insights from BASE TTS

Introduction to BASE TTS

BASE TTS introduces a novel direction in Text-to-Speech (TTS) technology, leveraging the potential of large-scale language models and novel speech tokenization techniques. The study demonstrates a significant leap in speech synthesis by utilizing a billion-parameter model trained on an unprecedented dataset of 100,000 hours of speech. This model, named Big Adaptive Streamable TTS with Emergent abilities (BASE TTS), encapsulates the essence of bringing text-to-speech synthesis closer to a natural human-like performance, particularly in rendering textually complex sentences with natural prosody.

Novel Contributions

The main contributions of this work are threefold:

  1. Largest TTS Model: BASE TTS sets a new benchmark in the field by being the largest model to date, with 1 billion parameters. It outperforms existing large-scale TTS models in subjective evaluations, providing more natural speech synthesis.
  2. Emergent Abilities and Benchmark: By scaling the model and dataset size, BASE TTS exhibits emergent abilities, allowing it to effectively render complex prosodic patterns and textual nuances. A specialized dataset and subjective evaluation benchmark for "emergent abilities" in TTS are also introduced, enabling systematic study of model performance against challenging linguistic phenomena.
  3. Novel Speech Representations: The introduction of speaker-disentangled speechcodes, built atop a WavLM Self-Supervised Learning model, demonstrates a sophisticated method to capture only the essential phonemic and prosodic information, achieving high-quality waveform synthesis even at significant compression rates.

Technical Overview

BASE TTS approaches the challenge of TTS through an LLM-based paradigm, treating TTS as a next-token-prediction problem. The model architecture comprises a Transformer-based autoregressive model coupled with discrete speech representations termed speechcodes. These speechcodes, derived using a novel tokenization technique, encapsulate speaker ID disentanglement and compression. For the practical application of converting these speechcodes into waveforms, a convolution-based speechcode decoder is employed, markedly enhancing computational efficiency without sacrificing speech quality.

The dataset used for training BASE TTS, consisting of 100,000 hours of public domain speech data, is significantly more extensive than those used in prior studies, aiding the model in learning from a diverse set of linguistic and prosodic patterns. Notably, BASE TTS employs strategies such as Byte-Pair Encoding (BPE) on speechcodes to optimize sequence length and thus model performance over longer audio sequences.

Theoretical Implications and Future Prospects

The implication of this research extends beyond mere improvement in TTS quality; it explore the potential emergence of new capabilities as TTS models scale. The phenomenon, observed in LLMs, where qualitative leaps in capability occur beyond certain scale thresholds, is hypothesized to apply to LTTS as well. BASE TTS's performance on the emergent abilities benchmark underscores the lasting impact of model and data scaling on TTS quality and complexity handling.

Future directions highlighted by this work include exploring the scalability of BASE TTS further and integrating text-only LLM knowledge to close the performance gaps in syntactic complexity and emotional expression. Additionally, addressing limitations such as occasional hallucinations or synthesis cutoffs emerging from autoregressive modeling is pivotal. Coupled with ethical considerations around misuse and biases within speech models, these form critical avenues for ongoing research.

Conclusion

BASE TTS's achievements herald a new era in TTS research, promising significantly more natural and expressive synthetic speech. By combining innovative speech tokenization methods with the power of large-scale datasets and models, BASE TTS paves the way for advancements in speech synthesis that could have wide-ranging applications, from enhancing communication aids to creating more immersive interactive systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.