NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

Published 17 Nov 2022 in cs.SD and eess.AS | (2211.09407v1)

Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications - i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing - by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (42)

View on Semantic Scholar

Summary

The paper introduces a unified framework that applies self-supervised learning to integrate voice conversion, TTS, singing synthesis, and voice designing.
The paper leverages a self-supervised backbone to reduce the dependency on annotated data while enhancing performance and rapid convergence.
The paper demonstrates task-specific adaptation that boosts controllability and efficiency, achieving high synthesis quality across multiple voice applications.

The paper "NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis" introduces an innovative framework for voice synthesis that addresses several challenges in the field. Traditional voice synthesis models often require extensive audio data annotated with labels like text transcriptions or music scores. NANSY++ seeks to overcome this limitation by utilizing a self-supervised learning approach, eliminating the need for paired annotations.

Key Contributions:

Unified Framework: NANSY++ offers a comprehensive framework capable of tackling various voice synthesis applications. This includes:
- Voice Conversion: Transforming the voice of one person to sound like another.
- Text-to-Speech (TTS): Generating spoken language from text.
- Singing Voice Synthesis: Creating singing from musical scores.
- Voice Designing: Manipulating and designing unique voices.
Self-Supervised Backbone: The backbone network in NANSY++ is trained in a self-supervised manner, which significantly reduces the dependency on labeled data. This allows the model to generalize well across different tasks while maintaining high-quality outputs.
Task-Specific Adaptation: After training the backbone, the model is adapted for specific tasks through partial modeling of analysis features. This modular approach ensures that the framework is both flexible and efficient across its applications.
Performance and Efficiency: The authors demonstrate that NANSY++ offers several performance benefits, including enhanced controllability over voice features, improved data efficiency, and rapid convergence during training. These advantages do not compromise synthesis quality, which remains high.

Overall, NANSY++ represents a significant advancement in voice synthesis by unifying diverse applications under a single framework, optimizing data and training efficiencies, and achieving excellent synthesis quality without extensive labeled datasets. The authors support their claims with extensive experimental results and provide audio samples that validate the system's capabilities.

Markdown Report Issue