Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks (2407.18571v2)

Published 26 Jul 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Speech bandwidth expansion is crucial for expanding the frequency range of low-bandwidth speech signals, thereby improving audio quality, clarity and perceptibility in digital applications. Its applications span telephony, compression, text-to-speech synthesis, and speech recognition. This paper presents a novel approach using a high-fidelity generative adversarial network, unlike cascaded systems, our system is trained end-to-end on paired narrowband and wideband speech signals. Our method integrates various bandwidth upsampling ratios into a single unified model specifically designed for speech bandwidth expansion applications. Our approach exhibits robust performance across various bandwidth expansion factors, including those not encountered during training, demonstrating zero-shot capability. To the best of our knowledge, this is the first work to showcase this capability. The experimental results demonstrate that our method outperforms previous end-to-end approaches, as well as interpolation and traditional techniques, showcasing its effectiveness in practical speech enhancement applications.

Summary

The paper introduces an end-to-end GAN approach that transforms narrowband speech to high-quality wideband audio using multi-scale and multi-period discriminators.
It employs a convolutional U-net generator with multi-receptive field fusion, achieving a state-of-the-art Log Spectral Distance of 1.047 at an 8x upsampling ratio.
The unified model demonstrates robust zero-shot performance across various upsampling ratios, simplifying deployment in real-world speech enhancement applications.

Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks

Introduction

The paper "Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks" by Mahmoud Salhab and Haidar Harmanani addresses a critical problem in the field of signal processing: the transformation of narrowband speech signals into wideband ones. This process, known as Speech Bandwidth Expansion (BWE), enhances the audio quality, clarity, and perceptibility of speech signals, which is particularly vital for applications like telephony, compression, text-to-speech synthesis, and speech recognition. The proposed solution leverages a high-fidelity generative adversarial network (GAN) to achieve this transformation in an end-to-end manner, which contrasts with traditional cascaded systems that often involve multiple sequential processes.

Methodology

Data Preparation

The approach begins by preparing a dataset $\mathcal{D}$ consisting of pairs of speech signals sampled at different frequencies. Specifically, each pair includes a narrowband signal $\hat{x}_m$ and a wideband signal $x_m$ . The goal is to learn a mapping function $\mathcal{F}_{\theta}$ via machine learning that can upscale $\hat{x}_m$ to produce high-fidelity wideband speech signals $\acute{x} \approx x_m$ .

Model Architecture

The authors employ a convolutional model augmented with adversarial training to develop the upscaling function $\mathcal{F}_{\theta}$ . The model comprises a generator and two types of discriminators: multi-scale and multi-period discriminators.

Generator: Uses a convolutional U-net-like architecture. This network takes in low-resolution mel-spectrograms and outputs higher resolution versions. It incorporates Multi-Receptive Field Fusion (MRF) to handle different time scales.
Discriminators: The multi-period discriminator captures periodic segments in the speech signal, while the multi-scale discriminator detects long-range dependencies.

Training Loss

The training objectives include adversarial loss, mel-spectrogram reconstruction loss, and feature matching loss. These losses ensure not only that the generated signal is indistinguishable from real wideband speech but also that it maintains important spectral characteristics.

Experimental Setup

The VCTK dataset, which includes multiple speakers and accents, is used for training and evaluation. The models are trained on different upsampling ratios (2, 4, and 8), and a unified model is trained to handle all these ratios simultaneously. Various configurations are tested, including a zero-shot setting where the model generalizes to new upsampling ratios not seen during training.

Results

The results demonstrate that the proposed model consistently outperforms several end-to-end baselines, such as AudioUNet, Temporal FiLM, and AFiLM, across various upsampling ratios. At an upsampling ratio of 8, the proposed model achieves a Log Spectral Distance (LSD) of 1.047, significantly better than previous neural-based methods. When compared to traditional cascaded approaches like NVSR for lower upsampling ratios, the results are competitive.

The unified model also proves effective in zero-shot settings, maintaining robust performance across unseen upsampling ratios, significantly outperforming traditional interpolation methods.

Implications and Future Directions

The proposed method has both practical and theoretical implications. Practically, it simplifies the deployment of speech enhancement systems by using a single unified model capable of handling multiple upsampling ratios. Theoretically, it adds to the body of research demonstrating the efficacy of GANs in generating high-fidelity speech data.

Looking forward, the work opens avenues for deploying these models in real-time applications like low-bandwidth telephony systems, improving audio quality in video conferencing, and enhancing the performance of speech recognition systems trained on wideband data but applied to narrowband signals. Further research could explore the integration of these models in more complex speech processing pipelines, potentially incorporating real-world noise and distortions to make them more robust.

Conclusion

This paper presents a novel, end-to-end approach for speech bandwidth expansion using high-fidelity GANs. Its contributions lie in demonstrating superior performance over existing methods, the capability for zero-shot generalization, and the simplification brought by a unified model capable of handling various upsampling ratios. These findings mark a significant step forward in the field of neural speech enhancement, providing a strong foundation for both future research and practical applications in digital communication and speech technology.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArxivSound/status/1818112726545133679

https://twitter.com/AudioAndSpeech/status/1818329136835842090