One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Published 10 Apr 2019 in cs.LG, cs.SD, eess.AS, and stat.ML | (1904.05742v4)

Abstract: Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training. This is achieved by disentangling speaker and content representations with instance normalization (IN). Objective and subjective evaluation shows that our model is able to generate the voice similar to target speaker. In addition to the performance measurement, we also demonstrate that this model is able to learn meaningful speaker representations without any supervision.

Abstract PDF Upgrade to Chat

Citations (229)

View on Semantic Scholar

Summary

The paper presents a one-shot voice conversion method that separates speaker and content representations without requiring parallel data.
It employs dedicated speaker and content encoders along with Adaptive Instance Normalization in the decoder to effectively align latent features.
Quantitative evaluations and t-SNE visualizations validate the model's ability to robustly convert voices and cluster speaker embeddings for unseen speakers.

An Expert Review of "One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization"

Voice conversion (VC) is an important area of research in speech signal processing, focusing on altering the speaker characteristics of speech signals while maintaining the linguistic content. This paper presents a novel approach for one-shot VC that operates without the necessity of pre-existing parallel data of source and target speakers. The proposed method leverages the disentanglement of speaker and content representations using instance normalization (IN).

Overview and Methodology

The authors introduce a solution that circumvents the limitations of traditional VC models, which require the presence of the target speaker in the training data. By adopting a one-shot learning paradigm, the method facilitates the conversion of voices from unseen speakers using only single utterances from both source and target speakers. The approach hinges on disentangling speaker identity and linguistic content through a model comprising three key components: a speaker encoder, a content encoder, and a decoder.

The speaker encoder isolates the speaker-specific characteristics while the content encoder captures the linguistic information devoid of speaker influence. The integration of Adaptive Instance Normalization (AdaIN) in the decoder aligns these disentangled representations to synthesize the converted speech. This architecture inherently encourages the learning of factorized latent representations that are foundational to one-shot voice conversion.

Numerical Evaluation

Objective evaluations demonstrate that the proposed model successfully converts voice characteristics to match target speakers in unseen conditions. The paper highlights the effectiveness of global variance analysis, showing alignment of spectral distributions between converted and target speech, which is critical in evaluating conversion accuracy. Additionally, the study includes spectrogram analysis to visually confirm the conversion of fundamental frequency components without altering the phonetic content.

The model's ability to produce what is termed as 'meaningful speaker embeddings', despite the absence of explicit supervisory labels, is corroborated by t-SNE visualizations. These embeddings effectively cluster speech segments from different speakers, indicating robust speaker characteristic learning. In ablation studies, the implementational impact of instance normalization is quantified, demonstrating its role in attenuating speaker identity traces in the content encoder.

Implications and Future Directions

Practically, this approach to one-shot VC holds transformative potential for applications in personalized text-to-speech systems, anonymization technologies, and other contexts where speaker identity needs to be separated from linguistic content without extensive training datasets.

Theoretically, this work contributes to broader discussions in representation learning, especially in the utilization of normalization techniques like IN to facilitate feature disentanglement. It also corroborates the capability of non-adversarial models to learn complex audio transformations, challenging the dominance of GANs and similar complex frameworks in non-parallel data settings.

Future explorations may focus on enhancing the model's robustness across varied linguistic domains and accent variations or integrating more sophisticated transformation layers for refining speech texture and prosody beyond basic speaker characteristics. Additionally, expanding this framework to other modalities, such as video-to-audio transformations, presents intriguing interdisciplinary opportunities.

In summary, this paper presents a streamlined and effective approach to voice conversion with broad implications, marking a step towards more versatile and accessible voice conversion systems.

Markdown Report Issue