VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Published 18 Jun 2021 in eess.AS, cs.CL, cs.MM, cs.SD, and eess.SP | (2106.10132v1)

Abstract: One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.

Abstract PDF Upgrade to Chat

Citations (125)

View on Semantic Scholar

Summary

The paper demonstrates that integrating mutual information minimization with vector quantization effectively disentangles speech representations for one-shot voice conversion.
It employs a multi-loss training strategy with vCLUB for accurate MI estimation, outperforming methods like AutoVC and AdaIN-VC in content preservation and acoustic fidelity.
Objective improvements in CER/WER and MOS scores underline the practical impact of reduced content leakage and enhanced voice synthesis quality.

Overview of VQMIVC: Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

The paper introduces VQMIVC, a novel approach to one-shot voice conversion (VC) leveraging vector quantization (VQ) and mutual information (MI) to achieve unsupervised speech representation disentanglement. The challenge in VC lies in modifying a source speaker's utterance to match the target speaker's voice characteristics using only a single utterance of the target speaker. VQMIVC addresses this by disentangling speech into content, speaker, and pitch representations, minimizing their interrelationships with MI.

The authors highlight that previous methods for speech representation disentanglement have neglected the correlation between different speech representations, leading to performance degradation due to content leakage into speaker representations. VQMIVC mitigates this by integrating MI during training, significantly reducing dependencies between speech components.

Method and Architecture

The VQMIVC system is built on a framework that decomposes an utterance into content, speaker, and pitch factors through four primary components:

Content Encoder: Utilizes VQ and contrastive predictive coding (VQCPC) to extract frame-level content representations. This effectively quantizes the speech to filter out non-linguistic details.
Speaker Encoder: Generates a speaker representation vector from acoustic features, designed to retain speaker-specific characteristics.
Pitch Extractor: Derives normalized fundamental frequency ( $F_0$ ) as the pitch representation. It ensures that speaker-specific intonation is not entangled with residual content information.
Decoder: Synthesizes the final output by mapping content, speaker, and pitch representations back into acoustic features.

To optimize the disentanglement process, the authors introduce a multi-loss training strategy combining VQCPC, reconstruction, and MI losses. During training, MI minimization is directly applied to reduce correlations among representations, facilitated by Variational Contrastive Log-ratio Upper Bound (vCLUB) for accurate MI estimation.

Analytical Results

The authors underscore the superiority of VQMIVC over existing models such as AutoVC, AdaIN-VC, and VQVC+ in terms of both objective and subjective evaluations. Objective metrics, such as character and word error rates (CER/WER), alongside $F_0$ Pearson correlation coefficients, demonstrate improved content preservation and pitch consistency. Subjective mean opinion scores (MOS) indicated notable improvements in perceived speech naturalness and speaker similarity, attributing this to the effective disentanglement methodology.

MI minimization notably reduced content leakage into speaker representations, verified through lowered MI values and enhanced ASR outcomes.

Implications and Future Directions

VQMIVC's architecture and training approach make substantial contributions to the field of one-shot VC, particularly highlighting the efficacy of MI in disentangling interactive speech components without extensive supervision. This can potentially drive further advancements in zero-shot learning scenarios and improve real-world applications where speaker data is limited.

Looking forward, integrating the MI-based disentanglement technique with larger scale and more diverse datasets could explore its capabilities in multilingual and multi-accent voice conversion scenarios. Additionally, adopting this approach in conjunction with more advanced vocoders may enhance the synthesis quality, opening avenues for more seamless and natural voice conversion systems.

The study points to promising directions in unsupervised learning for speech applications, suggesting that constraining MI can effectively separate complex speech attributes in a one-shot learning environment.

Markdown Report Issue