Adversarially Trained End-to-end Korean Singing Voice Synthesis System

Published 6 Aug 2019 in cs.SD and eess.AS | (1908.01919v1)

Abstract: In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods -- local conditioning of text and pitch, and conditional adversarial training -- are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (77)

View on Semantic Scholar

Summary

The paper introduces a novel end-to-end architecture that leverages phonetic enhancement masking for improved pronunciation accuracy.
It employs local conditioning and conditional adversarial training in the super-resolution network to generate realistic singing voices.
Evaluation on a dataset of 60 Korean pop songs achieved an F1-score of 0.846, highlighting significant enhancements in synthesis quality.

Adversarially Trained End-to-end Korean Singing Voice Synthesis System

The paper "Adversarially Trained End-to-end Korean Singing Voice Synthesis System" proposes a novel architecture for Korean singing voice synthesis (SVS) capable of generating singing voices directly from lyrics and symbolic melodies. The approach introduces three innovative techniques: phonetic enhancement masking, local conditioning on text and pitch in the super-resolution network, and a conditional adversarial training strategy.

Overview of the Proposed System

The proposed framework is structured into two main modules: a mel-synthesis network and a super-resolution network. The mel-synthesis network is responsible for generating a mel-spectrogram from the input text and pitch information, while the super-resolution network upsamples this mel-spectrogram into a linear-spectrogram. This design choice obviates the need for vocoder feature prediction, which often limits synthesis quality in traditional SVS systems.

Key Contributions

Phonetic Enhancement Masking: This method generates implicit formant masks from the text input, allowing the model to focus specifically on pronunciation features. The empirical results suggest that this results in more accurate phonetic representations.
Local Conditioning and Adversarial Training: By locally conditioning the super-resolution network on text and pitch data, and employing a conditional adversarial training scheme, the system achieves a more realistic and higher-quality auditory output. The architecture leverages techniques such as projection discriminators and R1 regularization to stabilize the adversarial training process.

Experimental Validation

The authors validate their approach with a newly collected dataset composed of 60 Korean pop songs, with recordings manually aligned to the song lyrics and midi files. The dataset enabled the authors to evaluate the performance of their model through both quantitative and qualitative assessments.

Numerical Results

F1-score: The best-performing model configuration achieved an F1-score of 0.846, indicating that the generated pitch closely matches the conditioned input pitch.
The paper compares the pronunciation accuracy, sound quality, and naturalness of singing voice between different model configurations and finds noticeable improvements when all proposed methods are utilized together.

Implications and Future Scope

The introduction of phonetic enhancement masking and conditional adversarial training presents a significant refinement in SVS methodology. This research indicates potential across multiple applications, including more naturalistic synthetic vocals in consumer music production and enhanced speech synthesis applications. Future work could explore the generalization of these techniques to other languages and more complex prosodic features, as well as the integration of neural vocoder systems to further enhance audio quality.

In summary, the paper presents a comprehensive advancement in the field of singing voice synthesis, balancing the theoretical enhancements with practical implications and setting a platform for future innovations in the domain of artificial auditory generation systems.

Markdown Report Issue