Emergent Mind

HiFi-VC: High Quality ASR-Based Voice Conversion

(2203.16937)
Published Mar 31, 2022 in cs.SD , cs.LG , and eess.AS

Abstract

The goal of voice conversion (VC) is to convert input voice to match the target speaker's voice while keeping text and prosody intact. VC is usually used in entertainment and speaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any VC systems, which are capable of generating voices unseen during model training, is of particular interest to both researchers and the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech. In this work, we propose a new any-to-any voice conversion pipeline. Our approach uses automated speech recognition (ASR) features, pitch tracking, and a state-of-the-art waveform prediction model. According to multiple subjective and objective evaluations, our method outperforms modern baselines in terms of voice quality, similarity and consistency.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.