Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Published 17 Jul 2024 in eess.AS, cs.AI, and eess.SP | (2407.12229v2)

Abstract: People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes, express strong emotions, and generate various NVs in zero-shot TTS. See https://aka.ms/emoctrl-tts for demo samples.

Abstract PDF HTML Upgrade to Chat

Authors (11)

Citations (5)

View on Semantic Scholar

Summary

The paper presents EmoCtrl-TTS, a novel zero-shot TTS approach that integrates flow-matching, arousal-valence, and laughter embeddings for expressive emotional synthesis.
It utilizes over 27,000 hours of expressive real-world speech and outperforms baselines like Voicebox and ELaTE in naturalness and emotion similarity, despite a slight trade-off with intelligibility.
The findings highlight significant potential for enhanced human-computer interactions in areas such as assistive technologies and entertainment by delivering refined emotional control.

Controlling Time-Varying Emotional States in Zero-Shot Text-to-Speech

The paper "LAUGH NOW CRY LATER: CONTROLLING TIME-VARYING EMOTIONAL STATES OF FLOW-MATCHING-BASED ZERO-SHOT TEXT-TO-SPEECH" introduces EmoCtrl-TTS, a novel approach in the field of zero-shot text-to-speech (TTS) synthesis, designed to generate speech with rich emotional content and non-verbal vocalizations (NVs), such as laughter and crying. The system leverages arousal and valence values alongside laughter embeddings to achieve a more nuanced control over the emotional states within the generated speech, a capability not extensively addressed in previous TTS systems.

Methodological Framework

EmoCtrl-TTS builds on a flow-matching-based zero-shot TTS framework by incorporating emotion and NV embeddings. The model uses a substantial dataset of over 27,000 hours of expressive real-world speech curated through pseudo-labeling, thereby overcoming the limitations typical of previous models, which relied on smaller, staged datasets. The inclusion of arousal and valence metrics offers a granular control over the emotional content, while laughter embeddings facilitate the generation of various NVs beyond laughter.

Evaluations and Results

The model's performance was evaluated using several test sets, including a Japanese-to-English speech-to-speech translation (S2ST) scenario and datasets testing the capability for fine-grained emotional transitions and response to real laughter and crying. EmoCtrl-TTS was found to significantly outperform baselines such as Voicebox and ELaTE in various metrics. Objective evaluation metrics, such as AutoPCP and Aro-Val SIM, indicated that EmoCtrl-TTS can better mimic the emotional transitions of source audio. Subjective evaluations further supported these findings, wherein EmoCtrl-TTS achieved higher scores in metrics like naturalness and emotion similarity. However, a moderate degradation in word error rates (WER) was noted in certain scenarios, suggesting room for enhancements in intelligibility alongside emotion control.

Implications and Future Work

By enabling more expressive and emotionally rich speech synthesis, EmoCtrl-TTS has notable implications for applications requiring nuanced emotional content, such as assistive technologies, entertainment, and advanced human-computer interactions. The methodology presented highlights the importance of large-scale, real-world datasets and sophisticated emotion representations like arousal-valence space for effective TTS models. Future work could focus on improving the WER and exploring the use of other emotional dimensions such as dominance for even more refined control. Additionally, research could extend toward adaptive learning mechanisms where the system dynamically adjusts emotional outputs based on contextual cues and feedback.

Ultimately, EmoCtrl-TTS represents an important advancement in TTS systems, pushing the boundaries of how synthetic speech can convey human-like emotional depth, thereby setting a new benchmark in zero-shot TTS synthesis.

Markdown Report Issue