ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Published 1 Jun 2023 in eess.AS, cs.LG, and cs.SD | (2306.01100v1)

Abstract: This paper presents ALO-VC, a non-parallel low-latency one-shot phonetic posteriorgrams (PPGs) based voice conversion method. ALO-VC enables any-to-any voice conversion using only one utterance from the target speaker, with only 47.5 ms future look-ahead. The proposed hybrid signal processing and machine learning pipeline combines a pre-trained speaker encoder, a pitch predictor to predict the converted speech's prosody, and positional encoding to convey the phoneme's location information. We introduce two system versions: ALO-VC-R, which uses a pre-trained d-vector speaker encoder, and ALO-VC-E, which improves performance using the ECAPA-TDNN speaker encoder. The experimental results demonstrate both ALO-VC-R and ALO-VC-E can achieve comparable performance to non-causal baseline systems on the VCTK dataset and two out-of-domain datasets. Furthermore, both proposed systems can be deployed on a single CPU core with 55 ms latency and 0.78 real-time factor. Our demo is available online.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces ALO-VC, a novel framework achieving low latency and high-quality one-shot voice conversion using phonetic posteriorgrams.
The method integrates signal processing with machine learning, employing variants ALO-VC-R and ALO-VC-E to enhance speaker similarity and prosody.
Experimental results on VCTK and out-of-domain datasets demonstrate real-time performance (RTF 0.78) and naturalness improvements over baselines.

Overview of ALO-VC: Low-Latency One-Shot Voice Conversion

The paper introduces ALO-VC, a sophisticated method for low-latency one-shot voice conversion (VC) designed to enable any-to-any speaker pair transformations with minimal delay and computational overhead. ALO-VC leverages phonetic posteriorgrams (PPGs) as a central non-parallel representation for capturing linguistic content, thus eliminating the need for a parallel dataset which often complicates training. Central to this method is a hybridized approach that integrates signal processing and machine learning advances, specifically utilizing pre-trained speaker encoders and a pitch predictor to optimize prosody in the converted speech.

System Architecture and Methodology

ALO-VC stands distinct in its structural efficiency and strategic use of encoding methods. Two instantiations of the system, ALO-VC-R and ALO-VC-E, were developed: the former utilizes a pre-trained d-vector speaker encoder while the latter employs the ECAPA-TDNN speaker encoder for improved performance. The framework is composed of the following components:

Acoustic Model: This component is a streamlined Conformer-based structure designed to output PPGs with causality by forgoing certain complex layers like multi-head attention, which contributes to a reduction in latency.
Speaker Encoder: Independently pre-trained encoders enhance the system’s ability to distinguish and emulate target speaker characteristics without direct overfitting to the content.
Conversion Model: This section handles the transformation of the source speech to match the target speaker's voice, integrating a pitch predictor and positional encoding to manage phoneme placement in real-time.

ALO-VC further benefits from a fine-tuned LPCNet for each gender, thereby refining the naturalness and quality of the synthesized speech. This gender-specific training provides an additional dimension of personalized refinement.

Experimental Insights

Experimental validation on the VCTK dataset, along with out-of-domain dataset testing, highlights that ALO-VC consistently achieves compelling results akin to or better than several baseline systems like VQMIVC and DiffVC. The paper showcases that both ALO-VC-R and ALO-VC-E not only perform efficiently in real-time scenarios with a real-time factor of 0.78 but also maintain high levels of naturalness and speaker similarity. Notably, ALO-VC-E utilizing the ECAPA-TDNN speaker encoder achieves higher speaker similarity, reflecting the superiority of its encoder architecture for voice conversion tasks.

Implications and Future Directions

ALO-VC represents an important advancement for real-time voice conversion applications, such as telecommunication, accessibility technologies for impaired speech, and virtual avatars, where latency and computational efficiency are crucial. The system's successful balance of high-quality conversion and low latency opens pathways for its integration into mobile and IoT devices, which often face computational limitations.

Theoretical implications suggest further exploration into enhancing the quality and stability of converted speech, perhaps through the incorporation of neural vocoders with adaptive capabilities or real-time learning mechanisms. Future research may explore the integration of recent diffusion-based modeling techniques in a causal framework to further improve conversion performance, thereby bridging the gap between ALO-VC's low-latency strengths and the high-quality outputs of current non-causal systems. Additionally, this work could stimulate investigations into context-aware and emotion-preserving VC systems, which could enrich user experience in dynamically interactive environments.

Markdown Report Issue