DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks (2407.15624v1)
Abstract: In this paper, we propose a dual-stage architecture for bandwidth extension (BWE) increasing the effective sampling rate of speech signals from 8 kHz to 48 kHz. Unlike existing end-to-end deep learning models, our proposed method explicitly models BWE using excitation and linear time-varying (LTV) filter stages. The excitation stage broadens the spectrum of the input, while the filtering stage properly shapes it based on outputs from an acoustic feature predictor. To this end, an acoustic feature loss term can implicitly promote the excitation subnetwork to produce white spectra in the upper frequency band to be synthesized. Experimental results demonstrate that the added inductive bias provided by our approach can improve upon BWE results using the generators from both SEANet or HiFi-GAN as exciters, and that our means of adapting processing with acoustic feature predictions is more effective than that used in HiFi-GAN-2. Secondary contributions include extensions of the SEANet model to accommodate local conditioning information, as well as the application of HiFi-GAN-2 for the BWE problem.
- “Artificial bandwidth extension using deep neural networks for spectral envelope estimation,” in Proc. IEEE Int. Work. on Ac. Sig. Enh. (IWAENC), 2016, pp. 1–5.
- K.Y. Park and H.S., “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. IEEE Int. Conf. Ac., Speech, Sig. Proc. (ICASSP), 2000, vol. 3, pp. 1843–1846.
- “Real-time speech frequency bandwidth extension,” in Proc. IEEE Int. Conf. Ac., Speech, Sig. Proc. (ICASSP), 2021, pp. 691–695.
- P. Jax and P. Vary, “On artificial bandwidth extension of telephone speech,” Sig. Proc., vol. 83, no. 8, pp. 1707–1719, Jan. 2003.
- “Audio super resolution using neural networks,” in Proc. Work. at Int. Conf. on Learn. Rep. (ICLR), 2017.
- “Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations,” in Proc. Conf. on Neur. Info. Proc. Sys. (NeurIPS), 2019, pp. 10287–10298.
- “Time-frequency networks for audio super-resolution,” in Proc. IEEE Int. Conf. Ac., Speech, Sig. Proc. (ICASSP), 2018, pp. 646–650.
- “Bandwidth extension is all you need,” in Proc. IEEE Int. Conf. Ac., Speech, Sig. Proc. (ICASSP), 2021, pp. 696–700.
- K. Kumar et al., “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. Conf. on Neur. Info. Proc. Sys. (NeurIPS), 2019, vol. 32.
- “HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks,” in Proc. INTERSPEECH, 2020.
- A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv:1609.03499, 2016.
- “SEANet: A multi-modal speech enhancement network,” in Proc. INTERSPEECH, 2020.
- “Speech Enhancement and Dereverberation with Diffusion-based Generative Models,” IEEE/ACM Trans. on Aud., Speech, and Lang. Proc., vol. 31, pp. 2351–2364, 2023.
- X. Wang et al., “SpeechX: Neural Codec Language Model as a Versatile Speech Transformer,” arXiv:2308.06873, Aug. 2023.
- “UniAudio: An Audio Foundation Model Toward Universal Audio Generation,” arXiv:2310.00704, Dec. 2023.
- “DDSP: Differentiable digital signal processing,” in Proc. Int. Conf. on Learn. Rep. (ICLR), 2020, pp. 26–30.
- “HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features,” in Proc. IEEE Work. on App. of Sig. Proc. to Audio and Acoustics (WASPAA), 2021, pp. 166–170.
- “Efficient neural networks for real-time analog audio effect modeling,” in Proc. Aud. Eng. Soc. Conv., 2022.
- “Efficient neural networks for real-time analog audio effect modeling,” in Proc. Med. Im. Comp. and Comp. Ass. Int., 2015.
- CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, The Centre for Speech Technology Research (CSTR), University of Edinburgh, Edinburgh, 0.8.0 edition, 2012.
- “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE Int. Conf. Ac., Speech, Sig. Proc. (ICASSP), 2010, pp. 4214–4217.
- “Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation,” in Proc. IEEE Int. Conf. Ac., Speech, Sig. Proc. (ICASSP), June 2023.
- ITU, ITU-R recommendation BS. 1534-3: Method for the subjective assessment of intermediate quality level of audio, International Telecommunication Union, Oct. 2015.
- S. Han and J. Lee, “NU-wave 2: A general neural audio upsampling model for various sampling rates,” in Proc. INTERSPEECH, 2022.