SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis (2001.05685v1)

Published 16 Jan 2020 in cs.SD and eess.AS

Abstract: Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at https://github.com/tianrengao/SqueezeWave.

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a novel SqueezeWave vocoder that cuts computational load by up to 214x compared to WaveGlow.
It employs depthwise separable convolutions and reshaped input waveforms to enable efficient real-time synthesis on edge devices.
Empirical evaluations confirm that different SqueezeWave configurations maintain audio quality while running effectively on hardware like Raspberry Pi and MacBook Pro.

SqueezeWave: A Technical Overview

This essay discusses the paper "SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis", which introduces a new family of vocoders designed to enable efficient real-time speech synthesis on edge devices by leveraging the architectural principles of SqueezeWave. The work is motivated by the limitations of current vocoder models, particularly in terms of computational demands and latency, which render them suboptimal for on-device deployment.

Background and Motivation

In contemporary TTS systems, vocoders are crucial for transforming acoustic features like mel-spectrograms into audible waveforms. Dominant approaches such as WaveNet and WaveGlow illustrate significant advancements in speech quality but remain computationally prohibitive for edge deployment. The auto-regressive nature of many vocoders restricts parallelization, whereas feed-forward models such as WaveGlow, though parallelizable, still exceed the computational capacities of mobile processors. This paper addresses the need for efficient vocoders facilitating real-time synthesis directly on edge devices, thereby enhancing privacy and reducing reliance on cloud resources.

SqueezeWave Architecture

SqueezeWave builds upon the flow-based WaveGlow model but introduces several architectural innovations to drastically reduce computational requirements:

Reshaping Input Waveforms: By altering the temporal and channel dimensions of input audio tensors, SqueezeWave reduces the computational complexity inherent to WaveGlow. It effectively aligns temporal resolution with that of supporting mel-spectrograms, eliminating unnecessary redundancies.
Depthwise Separable Convolutions: Drawing on principles from efficient image recognition models, SqueezeWave employs depthwise separable convolutions. This results in a significant reduction in multiplications, achieving approximately a threefold reduction in computational costs in certain layers.
Additional Optimizations: The paper further refines the network through the elimination of dilated convolutions and the merging of processing branches within the WN function, enhancing both computational efficiency and structure simplicity.

Empirical Evaluation

SqueezeWave variants, characterized by different configurations of temporal and channel dimensions (e.g., SW-128L, SW-128S), demonstrate substantial efficiency gains. For instance, SW-128S achieves a reduction by a factor of up to 214x in required MACs when compared to WaveGlow, without significant degradation in audio quality as measured by MOS scores. These vocoders demonstrate the capability for real-time operation on both a Macbook Pro and a Raspberry Pi 3B+, highlighting the practical applicability of SqueezeWave in diverse hardware environments.

Implications and Future Directions

This research underscores substantial advancements in the feasibility of deploying high-quality TTS functionalities directly on edge devices. The implications are profound, fostering improved privacy, decreased latency, and independence from cloud infrastructures. As consumer and developer demand for on-device AI grows, future work could explore further architectural optimizations, wider hardware adaptability, and enhancements in audio quality through supplemental processing techniques like noise cancellation.

In conclusion, SqueezeWave represents a significant step toward embedding advanced TTS capabilities within resource-constrained devices, carving a path for broader adoption and innovation in the field of on-device speech technologies.

PDF Markdown

Related Papers

GitHub

GitHub - tianrengao/SqueezeWave (255 stars)

Tweets

https://twitter.com/_akhaliq/status/1218012120786317312

https://twitter.com/pythontrending/status/1218456889559855110

https://twitter.com/CarlRioux/status/1218706774749892609

https://twitter.com/asifrazzaq1988/status/1218698606263189504