Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Published 5 Jun 2022 in eess.AS, cs.CL, and cs.SD | (2206.02147v3)

Abstract: Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective. The code is available at \url{https://github.com/Zain-Jiang/Dict-TTS}.

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper presents Dict-TTS, a novel method for disambiguating polyphones through an S2PA module that maps semantic patterns to dictionary entries.
It achieves superior pronunciation accuracy and prosody modeling without annotated phoneme labels by leveraging end-to-end mel-spectrogram training.
Dict-TTS demonstrates versatile applicability across languages, reducing training complexity and enabling performance gains via ASR pre-training.

Insights into "Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech"

The paper "Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech" presents an innovative method for addressing the polyphone disambiguation problem in text-to-speech (TTS) systems, leveraging prior dictionary knowledge in a semantic-aware generative model. The main contribution of this work lies in the proposed Dict-TTS, which refines pronunciation accuracy by integrating an online dictionary as a rich pre-existing source of linguistic knowledge without requiring annotated phoneme labels, thus minimizing the dependency on extensive annotated data and language expert intervention.

The paper introduces a novel semantics-to-pronunciation attention (S2PA) module, which intelligently maps semantic patterns from input text sequences to corresponding dictionary semantics, effectively enhancing the model’s ability to disambiguate polyphones. The strength of the model is further demonstrated by surpassing existing state-of-the-art systems across multiple languages in terms of pronunciation accuracy and prosody modeling.

Key Attributes and Results

Key elements of the proposed model include:

Semantic Encoder and S2PA Module: The semantic encoder derives semantic representations from character inputs, followed by the S2PA module, which integrates dictionary semantics for phoneme disambiguation. This structure aligns character representations within the semantic space, significantly boosting the model's capabilities in correctly mapping text to speech.
End-to-End Training: The integration into TTS models is seamless, as it permits end-to-end training leveraging mel-spectrogram reconstruction loss, avoiding the conventional necessity for phoneme labels. This approach reduces training costs and complexity.
Performance Evaluation: Empirical results expressed in phoneme error rates show that Dict-TTS achieves competitive and often superior performance compared to traditional phoneme-based systems such as those utilizing rule-based or neural network-based G2P modules. For instance, in the tested Mandarin dataset, Dict-TTS yielded phoneme error rates lower than those seen with benchmark G2P tools, illustrating its potency in real-world applications.

Contributions and Implications

The contributions of Dict-TTS are notable for several reasons:

Integration with Pre-existing Knowledge: By tapping into existing dictionary resources, Dict-TTS reduces reliance on explicit labels while enhancing the pronunciation and prosody of TTS outputs.
Generalization Capacity: The method’s design enables compatibility with diverse languages and dialects, serving as a versatile solution across TTS applications globally, especially for under-resourced languages or dialects lacking comprehensive annotated corpora.
Pre-training on ASR Data: The possibility of pre-training on automatic speech recognition datasets provides an avenue for further accuracy improvements by expanding semantic comprehension abilities through large-scale data exposure.

Theoretical Impact and Future Directions

The theoretical implications of Dict-TTS extend to tasks beyond TTS, such as sequence labeling and language modeling. The framework set forth in this research encourages revisiting the utility of external semantic repositories to enhance machine learning models for various granular NLP tasks.

For future research, exploration into expanding Dict-TTS to accommodate syntactic information might further refine prosody and pronunciation synthesis. The consideration of syntactic nuances and the design of more sophisticated dictionary datasets are plausible pathways for achieving even higher levels of expressiveness and authenticity in synthetic speech.

In conclusion, Dict-TTS contributes significantly to the field of text-to-speech systems by showcasing an efficient utilization of existing linguistic infrastructure, with practical implications for numerous real-world speech applications and further enrichments anticipated through future research endeavors.

Markdown Report Issue