Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis (2405.09171v1)

Published 15 May 2024 in cs.SD and eess.AS

Abstract: It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Marc Schröder, “Emotional speech synthesis: A review,” in Seventh European Conference on Speech Communication and Technology, 2001.
  2. “An overview of affective speech synthesis and conversion in the deep learning era,” Proceedings of the IEEE, 2023.
  3. “The age of artificial emotional intelligence,” Computer, vol. 51, no. 9, pp. 38–46, 2018.
  4. Handling emotions in human-computer dialogues, Springer, 2010.
  5. ZHOU KUN, “Emotion modelling for speech generation,” 2022.
  6. “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
  7. Julia Hirschberg, “Pragmatics and intonation,” The handbook of pragmatics, pp. 515–537, 2004.
  8. “Perception of affective and linguistic prosody: an ale meta-analysis of neuroimaging studies,” Social cognitive and affective neuroscience, vol. 9, no. 9, pp. 1395–1403, 2014.
  9. “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2023.
  10. “Speech synthesis with mixed emotions,” IEEE Transactions on Affective Computing, pp. 1–16, 2022.
  11. “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” 2018.
  12. “Relative attributes,” in 2011 International Conference on Computer Vision. IEEE, 2011, pp. 503–510.
  13. “Controlling emotion strength with relative attribute for end-to-end speech synthesis,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 192–199.
  14. “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
  15. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” 2022.
  16. “Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech,” 2022.
  17. “Predicting expressive speaking style from text in end-to-end speech synthesis,” CoRR, vol. abs/1808.01410, 2018.
  18. “Mixture density network for phone-level prosody modelling in speech synthesis,” CoRR, vol. abs/2102.00851, 2021.
  19. “Towards multi-scale speaking style modelling with hierarchical context information for mandarin speech synthesis,” 2022.
  20. “Text-driven emotional style control and cross-speaker style transfer in neural tts,” 2022.
  21. “Text aware emotional text-to-speech with bert,” 09 2022, pp. 4601–4605.
  22. “Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” 2022.
  23. “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502.
  24. “opensmile – the munich versatile and fast open-source audio feature extractor,” 01 2010, pp. 1459–1462.
  25. “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  26. “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.
  27. “Attention is all you need,” 2017.
  28. “Adam: A method for stochastic optimization,” 2017.
  29. “The blizzard challenge 2013,” 2014.
  30. “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
  31. “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
  32. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
  33. “Emotion in speech: The acoustic attributes of fear, anger, sadness, and joy,” Journal of psycholinguistic research, vol. 28, pp. 347–65, 08 1999.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 9 likes.

Upgrade to Pro to view all of the tweets about this paper: