Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 115 tok/s Pro
Kimi K2 219 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion (2403.01494v1)

Published 3 Mar 2024 in eess.AS, cs.SD, and eess.SP

Abstract: In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosody training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosody variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods. Speech Samples are available at https://jeremychee4.github.io/pavits4EVC/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, jan 2023.
  2. “Multi-speaker emotional speech synthesis with fine-grained prosody modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 5729–5733.
  3. “Real-time speech emotion analysis for smart home assistants,” IEEE Transactions on Consumer Electronics, vol. 67, no. 1, pp. 68–76, 2021.
  4. “3d virtual worlds and the metaverse: Current status and future possibilities,” ACM Computing Surveys (CSUR), vol. 45, no. 3, pp. 1–38, 2013.
  5. “An improved cyclegan-based emotional voice conversion model by augmenting temporal dependency with a transformer,” Speech Communication, vol. 144, pp. 110–121, 2022.
  6. “Expressive voice conversion: A joint framework for speaker identity and emotional style transfer,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 594–601.
  7. “An improved stargan for emotional voice conversion: Enhancing voice quality and data augmentation,” arXiv preprint arXiv:2107.08361, 2021.
  8. “Speaker-independent emotional voice conversion via disentangled representations,” IEEE Transactions on Multimedia, 2022.
  9. “One-shot emotional voice conversion based on feature separation,” Speech Communication, vol. 143, pp. 1–9, 2022.
  10. “An overview & analysis of sequence-to-sequence emotional voice conversion,” 2022.
  11. “Improving model stability and training efficiency in fast, high quality expressive voice conversion system,” in Companion Publication of the 2021 International Conference on Multimodal Interaction, 2021, pp. 75–79.
  12. “Emotional voice conversion using multitask learning with text-to-speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020, pp. 7774–7778.
  13. “A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 6294–6298.
  14. “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 920–924.
  15. “Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training,” arXiv preprint arXiv:2103.16809, 2021.
  16. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
  17. “End-to-end adversarial text-to-speech,” in International Conference on Learning Representations, 2020.
  18. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
  19. “Period vits: Variational inference with explicit pitch modeling for end-to-end emotional speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2023, pp. 1–5.
  20. “Analysis of emotionally salient aspects of fundamental frequency for emotion detection,” IEEE transactions on audio, speech, and language processing, vol. 17, no. 4, pp. 582–596, 2009.
  21. “Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022, pp. 7237–7241.
  22. James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
  23. “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  24. “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
  25. “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” 2019.
  26. “Stargan v2: Diverse image synthesis for multiple domains,” 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 16 likes.

Upgrade to Pro to view all of the tweets about this paper: