Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness (2404.06714v3)

Published 10 Apr 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent advancements in NLP have seen Large-scale LLMs excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.

References (62)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/ballforest/status/1778265501527073145

https://twitter.com/AudioAndSpeech/status/1779759992410648676

https://twitter.com/AudioAndSpeech/status/1781299007492395135

https://twitter.com/realmofresearch/status/1782073182461288722

https://twitter.com/AudioAndSpeech/status/1778342023918989734

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness (2404.06714v3)

Summary

Related Papers

Tweets