Emergent Mind

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

(2402.18932)
Published Feb 29, 2024 in eess.AS and cs.SD

Abstract

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

Overview

  • The paper presents a novel framework for text-to-speech (TTS) synthesis extending capabilities to over 100 languages, especially benefiting those without transcribed audio data.

  • It utilizes unsupervised learning and a pretrained self-supervised multilingual speech foundation model to generate intelligible speech in languages lacking transcribed speech data.

  • The model combines supervised and unsupervised learning from diverse data types, including untranscribed speech and unspoken text, showing significant improvements in TTS quality across multiple languages.

  • This study marks a significant step towards universalizing TTS technology, potentially transforming global communication by making high-quality speech generation accessible for a wide range of languages.

Extending Multilingual Speech Synthesis to Languages Beyond Transcribed Data

Introduction

The development of text-to-speech (TTS) systems typically favors languages with an abundance of high-quality transcribed audio data. This situation presents a limitation, given the nearly 6,000 languages worldwide, many of which are considered low-resource due to the scarcity of such data. This paper introduces a novel framework that effectively expands TTS capabilities to over 100 languages, significantly increasing language coverage by utilizing unsupervised learning with untranscribed found data. This approach leverages a pretrained self-supervised multilingual speech foundation model for joint speech-text representation learning, demonstrating the ability to generate intelligible speech in languages without previously available transcribed speech data.

Related Work

Past efforts in multilingual TTS have been constrained by the availability of high-quality, paired speech-text data, limiting the scalability and applicability of TTS systems across the wide spectrum of global languages. Although some strategies have sought to alleviate data requirements through unpaired or synthetic training materials, these have often resulted in models with limited language coverage or compromised performance. By incorporating unsupervised learning strategies and leveraging advances in self-supervised speech pretraining and speech-text joint pretraining, this paper positions itself at the forefront of efforts to universalize TTS technology.

Proposed Framework

At the heart of the proposed solution is a joint multilingual speech-text model comprising several components designed to facilitate both supervised and unsupervised learning across languages. The framework employs a pretrained speech-to-feature (S2F) block and feature-to-speech (F2S) components, alongside novel training objectives suited for TTS language expansion. Crucially, this model introduces methods for leveraging found data—comprising speech-text paired data, untranscribed speech data, and unspoken text data—bypassing the need for curated datasets.

Training Objectives

The model is trained using a mixture of transcribed (paired) speech, untranscribed speech, and unspoken text data, enabling it to learn from a diversity of inputs. The training leverages RNN-T decoder alignments, feature loss, and duration prediction to optimize the model's performance across languages. A key innovation lies in the use of pseudo-labeling for untranscribed speech and aligned text MLM for unspoken text, enabling effective learning even in the absence of transcribed speech data.

Curriculum Training Procedures

The model employs a stage-wise training approach, beginning with the pretraining of speech and shared encoders, followed by targeted training of the shared encoder and the RNN-T decoder. The final stage involves joint training that integrates the supervised and unsupervised learning derived from various data types, refining the model's ability to generalize across languages.

Experimental Setting

The experimental framework underscores the model's applicability to a broad array of languages, demonstrating significant improvements in TTS quality. By training on diverse datasets spanning 100+ languages, and leveraging both public corpora and proprietary datasets, the study showcases the model's robustness and versatility.

Results

The evaluation of the model reveals promising outcomes, particularly in generating intelligible speech from untranscribed data in more than 30 languages. When minimal supervised data (around 15 minutes of transcribed found data) is incorporated, the intelligibility and naturalness scores closely match those of ground-truth data in several languages. It's a noteworthy achievement that illustrates the model's capacity to significantly reduce the gap between high-resource and low-resource languages in TTS applications.

Conclusion

This paper introduces a transformative approach to multilingual TTS development that dramatically increases language coverage without relying on extensively curated datasets. By harnessing unsupervised learning techniques alongside a novel joint speech-text model, the framework facilitates the generation of high-quality speech across a vast array of languages. Looking ahead, the implications for global communication and access to information are profound, offering a pathway to more inclusive and equitable technology deployment worldwide. Future iterations of this work may explore further optimizations and applications, solidifying the foundation laid by this significant step forward in TTS research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.