Conversational End-to-End TTS for Voice Agent (2005.10438v2)

Published 21 May 2020 in cs.SD and eess.AS

Abstract: End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach which has an auxiliary encoder and a conversational context encoder to reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed methods produce more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors, like fillers and repeated words, which makes the conversational speaking style more realistic.

Citations (62)

View on Semantic Scholar

Summary

The paper introduces a spontaneous conversational speech corpus and a novel context-aware end-to-end TTS approach for voice agents.
It proposes an architecture utilizing auxiliary and conversational context encoders with BERT embeddings to capture utterance and conversation-level features.
Subjective evaluations show the model significantly improves conversational prosody and preference scores compared to baselines, enabling synthesis of spontaneous speech behaviors.

The paper addresses the challenge of building a high-quality conversational Text-to-Speech (TTS) system for voice agents. It introduces a spontaneous conversational speech corpus and a conversation context-aware end-to-end TTS approach. The approach employs an auxiliary encoder and a conversational context encoder to capture utterance and context information within a conversation. The paper finds that the model can express spontaneous behaviors, enhancing the realistic nature of the conversational speaking style.

The paper identifies two key problems in building a conversational TTS system: developing a conversational speech corpus and creating a high-performance TTS model that captures prosody in conversations. To address the first problem, the paper introduces a new recording scheme for building spontaneous conversational corpora:

Conversational scenarios and transcripts are designed to ensure content variety and conversational context.
Speakers perform according to the scripts, modifying content and adding spontaneous behaviors.
Transcriptions are made of the speaker's actual speech content to ensure correct pronunciation.

The corpus includes the following spontaneous behaviors:

Fillers such as "um", "oh", "aha", "uh"
Repeated words or phrases
False starts
Reduced speech rate or pauses

The paper proposes a conversation context-aware end-to-end TTS approach, which uses an auxiliary encoder and a conversational context encoder.

The end-to-end TTS system is based on Tacotron2. The encoder consists of an embedding layer, three 1-D convolution layers followed by batch normalization and ReLU activations, and a BLSTM layer. Dropout is applied in all convolution and LSTM layers. The decoder is an auto-regressive module with a pre-net and two Zoneout-LSTM layers. The output of the second LSTM layer goes to the attention module. PostNet is a post-filter with five 1-D convolution layers. Stepwise monotonic attention is used. Parallel Wavenet is adopted as the neural vocoder.

The auxiliary encoder extracts text features using BERT (Bidirectional Encoder Representations from Transformers) embeddings and statistical features representing the syntactic structure:

$F_1$ : the number of characters in the current sentence
$F_2$ : the relative-position of the current character in the current sentence
$F_3$ : the number of characters in the current utterance
$F_4$ : the relative-position of the current character in the current utterance
$F_5$ : the number of sentences in the current utterance
$F_6$ : the relative-position of the current sentence in the current utterance

The auxiliary encoder uses a pre-net and a CBHG module. The features are up-sampled from character-level to phoneme-level and combined with the encoder outputs using addition.

The conversational context encoder extracts prosody-related features from sentence embeddings. BERT is used to extract sentence representations, and each embedding is attached by a one-hot vector as speaker ID. The conversational context encoder processes the sequence of sentence embeddings $E_{t-c:t}$ through a linear layer. A GRU (Gated Recurrent Unit) layer encodes the sequence $E_{t-c:t-1}$ to a state vector $S_{t}$ . $S_{t}$ and $E_t$ are concatenated and fed to the linear output layer.

The training corpus consists of 45 conversations between two native Chinese speakers (6 hours total, 3 hours per speaker). The agent speech data, containing about 2,000 utterances (3 hours), is used to train the TTS model. The encoder and decoder are pre-trained with a standard TTS corpus containing 6 hours of Chinese reading-style speech.

Three models are used in the subjective evaluation:

$M_1$ : baseline model
$M_2$ : $M_1$ plus auxiliary encoder
$M_3$ : $M_2$ plus conversational context encoder

For all TTS models, the phoneme sequence contains phonemes, punctuations, inter-word, and inter-syllable symbols. The output is Mel Spectrogram extracted with sample rate 16,000. Adam optimizer is used with $\beta_1=0.9$ , $\beta_2= 0.999$ , and the learning rate exponentially decays from $10^{-3}$ to $10^{-5}$ after 50,000 iterations.

In comparison mean opinion score (CMOS) listening tests with 20 native Chinese speakers, the auxiliary encoder improves performance over the baseline model by a CMOS score of 0.22 and a preference of 42.9% at the utterance level. At the conversation level, it achieves a CMOS score of 0.62 and a preference of 59.0%. The conversation context encoder improves the prosody expression by a CMOS score of 0.18 and preference 42.1% at the utterance level, and a CMOS score of 0.39 and preference rate of 57.0% at the conversation level. The models can express spontaneous behaviors such as fillers and repeated words.

PDF Markdown

Conversational End-to-End TTS for Voice Agent (2005.10438v2)

Summary

Related Papers