Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Published 1 Apr 2021 in cs.SD, cs.LG, and eess.AS | (2104.00355v3)

Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: speechbot.github.io/resynthesis.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (278)

View on Semantic Scholar

Summary

The paper introduces a novel disentanglement of speech content, prosody, and speaker identity to enable controllable, high-quality resynthesis.
It evaluates SSL methods like CPC, HuBERT, and VQ-VAE, demonstrating HuBERT’s advantage in achieving low-bitrate codecs and improved speaker conversion.
The findings highlight potential for efficient codecs and personalized TTS systems using discrete representations derived from self-supervised learning.

Analyzing Self-Supervised Discrete Representations for Speech Resynthesis

The paper under examination puts forth a novel approach utilizing self-supervised discrete representations for speech resynthesis. The authors primarily focus on disentangling representations to separately capture speech content, prosodic features, and speaker identity. This allows for controllable synthesis of speech, extracting high-quality, low-bitrate representations conducive to building efficient speech codecs.

Research Methodology and Experiments

The study evaluates various self-supervised learning (SSL) methods, predominantly focusing on recent approaches such as CPC, HuBERT, and VQ-VAE. Each method derives discrete units from speech to serve different aspects of resynthesis and codec efficiency. The evaluations span speech intelligibility, measured by PER and WER, speaker conversion efficacy assessed through EER, and reconstructed audio quality gauged by MOS. Additionally, VDE and FFE are employed to examine F0 fidelity.

Key Experimental Highlights:

Speech Reconstruction: Both HuBERT and CPC demonstrate superior intelligibility over VQ-VAE. However, VQ-VAE excels in F0 accuracy due to its complete audio reconstruction capability.
Voice Conversion: In speaker conversion and F0 manipulation tasks, the HuBERT and CPC models outperform VQ-VAE, indicating less entanglement of speaker features in discrete units.
Ultra-Lightweight Speech Codec: When adapted as a codec, HuBERT models achieve lower bitrates at 365 bps while providing superior perceptual quality than comparative methods such as Codec2, LPCNet, and Opus at significantly higher bitrates.

Implications of The Research

This research contributes significantly to the community by presenting a consolidated approach for speech resynthesis and establishing a new foundation for lightweight speech codecs. By examining the content and organizational qualities of discrete SSL-derived units, the authors provided a nuanced understanding of their potential in low-bitrate applications, representing an important shift for applications that demand high efficiency, such as in devices with restricted bandwidth or computing power.

The disentanglement of content, speaker identity, and F0 may also influence future models emphasizing controllable and expressive speech synthesis. The potential for controllable resynthesis across various dimensions holds promise for advancements in personalized TTS systems. Moreover, the successful adaptation of HuBERT as a codec demonstrates how unsupervised models, primarily used for ASR, can effectively transition into synthesis and codec tasks.

Speculations on Future Directions

This paper encourages exploration into further improvements in disentangling and representation learning. There remains a vast opportunity to refine SSL methods for better unit extraction quality, impacting speech synthesis capabilities extensively. Future research may explore enhancing codec models to preserve pitch and speaker information better while maintaining lower bitrates and exploring adaptive models that restructure based on context or user requirements dynamically.

The findings also imply cross-domain applications for SSL methodologies emphasizing audio-visual or audio-text synchronization, given their richly encoded features in discrete units. This cross-pollination could compel the development of more robust multi-modal systems, pivotal for complex IoT and AI-driven environments.

Conclusion

The research presented adeptly showcases the strength of self-supervised learning techniques in generating controllable and qualitatively superior speech synthesis models. The implications for codec design demonstrate not just academic but practical advancements in speech technology and hint at broader applications beyond current paradigms. As SSL techniques advance, they will undoubtedly play a pivotal role in transforming speech and language processing across various domains.

Markdown Report Issue