Papers
Topics
Authors
Recent
2000 character limit reached

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis (2111.05011v2)

Published 9 Nov 2021 in cs.LG, cs.SD, and eess.AS

Abstract: Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.

Citations (95)

Summary

  • The paper presents RAVE, a VAE-based model employing a dual-phase training process—representation learning and adversarial fine-tuning—to produce high-quality audio.
  • It achieves remarkable computational efficiency with synthesis speeds of 985kHz on CPU and 11.7MHz on GPU, outperforming models like NSynth and SING.
  • The framework supports versatile applications such as timbre transfer and audio compression, balancing reconstruction fidelity with a compact latent representation.

RAVE: A Variational Autoencoder for Fast and High-Quality Neural Audio Synthesis

Introduction

The paper presents "RAVE," a novel neural audio synthesis framework leveraging Variational AutoEncoders (VAEs). The primary objective of RAVE is to achieve real-time audio synthesis with high fidelity and efficiency, addressing the challenges associated with raw waveform modeling. Traditional approaches often struggle with computational intensity, limited control ability, or low sampling rates, especially when striving for high-quality audio generation. This research introduces a two-stage training procedure that enables RAVE to overcome these barriers and produce high-quality 48kHz audio, running significantly faster than real-time on standard hardware.

Methodology

The core innovation in RAVE lies in its dual-phase training architecture. The first phase focuses on representation learning using VAEs, while the second phase involves adversarial fine-tuning to refine the audio quality. Initially, the model undergoes training as a regular VAE, where dimensionality reduction and compact representation are emphasized through a spectral distance-based loss. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Reconstruction of an input sample with several fidelity parameters ff.

Representation Learning

By employing a multiscale spectral distance that focuses on amplitude spectra, the model circumvents the need to accurately reconstruct phase information, leading to a more perceptually relevant representation. This spectral distance drives the first-stage VAE training, effectively guiding the encoder-decoder architecture to converge on capturing essential audio attributes while minimizing undesired noise.

Adversarial Fine-Tuning

For the final synthesis quality, RAVE integrates a Generative Adversarial Network (GAN) framework in its second training phase. This phase freezes the encoder and solely optimizes the decoder against a discriminator network to enhance the naturalness of generated audio. The adversarial objective is complemented by feature matching losses to stabilize and improve training outcomes, ensuring the synthesized samples are indistinguishable from real audio.

Results and Discussion

The experimental analysis validates RAVE's effectiveness, demonstrating its superiority over existing models such as NSynth and SING in terms of both quality and computational efficiency. On a qualitative scale, RAVE achieves a mean opinion score (MOS) of 3.01 compared to 2.68 for NSynth and 1.15 for SING. It also achieves this with significantly fewer parameters, indicating a more compact and efficient architecture. Figure 2

Figure 2

Figure 2: Example of timbre transfer using RAVE.

Synthesis Speed

The synthesis speed of RAVE is another crucial benefit. While autoregressive models like NSynth are hindered by computational bottlenecks, RAVE operates at 985kHz on a CPU and 11.7MHz on a GPU, showcasing its real-time applicability. This is largely attributable to the multiband decomposition, which efficiently handles high sampling rates without elevating the model complexity.

Post-Analysis and Applications

Post-training, the latent representation is scrutinized to balance reconstruction fidelity and representation compactness via Singular Value Decomposition (SVD). This approach allows RAVE to dynamically adjust latency dimensions based on a fidelity parameter without detriment to audio quality. Applications discussed include timbre transfer, where RAVE's model adaptation capabilities support cross-domain audio transformations efficiently, and signal compression, wherein the learned latent space facilitates high-ratio compression for downstream tasks.

Conclusion

RAVE represents a significant advancement in the field of neural audio synthesis, balancing quality and efficiency through advanced VAE-GAN architectures and strategic training. By releasing source code and pretrained models, the authors provide a foundation for further research and application in music technology and audio processing domains.

In essence, the proposed framework offers a compelling and practical solution for high-quality audio generation with potential applications in real-time systems, multimedia content creation, and beyond.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 5 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com