WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling (2408.16532v3)

Published 29 Aug 2024 in eess.AS, cs.LG, cs.MM, cs.SD, and eess.SP

Abstract: LLMs have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a novel codec tokenizer that compresses audio with a single quantizer and treats speech as a latent language.
It leverages an expanded VQ space, K-means clustering initialization, and extended attention mechanisms to enhance reconstruction quality.
Experiments show that using only 75 tokens per second, WavTokenizer achieves near-human UTMOS scores and robust semantic modeling.

An Expert Review: WavTokenizer – An Efficient Acoustic Discrete Codec Tokenizer for Audio LLMing

The paper "wavtokenizer: an efficient acoustic discrete codec tokenizer for audio LLMing" introduces WavTokenizer, a novel acoustic codec model positioned to significantly advance audio LLMing. This model aims to improve upon existing state-of-the-art (SOTA) models by achieving higher levels of compression, superior reconstruction quality, and richer semantic content in audio. In this essay, we summarize the key methodologies, experimental results, and implications of this research.

Introduction

The development of large-scale LLMs has facilitated significant advancements in tasks that involve multiple modalities, such as speech synthesis and audio generation. Central to these models is the codec tokenizer, which compresses high-dimensional audio signals into lower-dimensional discrete tokens. WavTokenizer offers several enhancements over existing models, such as extreme compression rates, state-of-the-art quality reconstruction, and enriched semantic information. This is accomplished by innovations in the design of both the encoder and decoder, as well as the implementation of a broader Vector Quantization (VQ) space and improved attention mechanisms.

Key Contributions

Conceptual Contribution:

The paper introduces a novel approach to reducing the number of quantizer layers to a single quantizer in an acoustic codec model. This method does not compromise the codec structure essential for uniformly modeling speech, music, and audio. Furthermore, by aligning the speech space with an expanded codebook space, the model suggests the potential of treating speech representation as a unique latent language.

Technical Contribution:

Key technical advancements include the expansion of the VQ space and the integration of K-means clustering initialization and random awakening strategies to ensure efficient utilization of a large codebook. Additionally, the extension of contextual modeling windows combined with attention mechanisms in the decoder significantly enhance the model’s performance. The decoder innovatively leverages an inverse Fourier transform and a multi-scale discriminator to improve audio reconstruction quality.

Experimental Contribution:

Extensive experiments demonstrate WavTokenizer's superior performance over existing models. Utilizing just 75 tokens per second, WavTokenizer achieves state-of-the-art reconstruction quality on the LibriTTS test-clean dataset. Additionally, the model demonstrates efficiency in contributing to rich semantic information, measured through various downstream tasks. Comprehensive ablation studies validate the importance of each component in the model.

Methodology

Encoder Design:

Similar to EnCodec, the encoder of WavTokenizer employs a convolutional architecture that sequentially downsizes the input audio signal. The final output is a compact latent feature representation which is then quantized by a single quantizer layer. Through experiments, it was determined that expanding the codebook size (from $2^{10}$ to $2^{12}$ ) significantly improved the model’s performance without sacrificing codebook utilization.

Quantization Strategy:

WavTokenizer innovates by broadening the VQ space and employing K-means clustering for initializing the codebooks. A forced activation strategy ensures that even a larger codebook is effectively utilized, supporting the hypothesis that speech can be treated as a unique language when quantized appropriately.

Decoder Design:

To better handle the reconstruction of audio from quantized tokens, the decoder deviates from traditional mirrored architectures. Instead, it employs an inverse Fourier transform along with a multi-scale discriminator, significantly reducing aliasing artifacts and enhancing perceptual quality. The incorporation of attention mechanisms within the decoder further bolsters the semantic richness and context-awareness of the reconstructed audio.

Training and Loss Functions:

The training leverages a hinge loss formulation for adversarial training, supplemented by quantizer loss, mel-spectrum reconstruction loss, adversarial loss, and feature matching loss. These combined losses ensure that the model not only achieves high fidelity in reconstruction but also maintains a rich semantic representation.

Experimental Results

The experimental evaluation focused on reconstructing audio from the test-clean subset of LibriTTS and comparing the results to multiple baselines such as EnCodec, HiFi-Codec, and DAC. WavTokenizer demonstrated superior performance with significantly lower token requirements (75 tokens/second), achieving near-human-level UTMOS scores.

Further analysis explored the impact of different contextual windows and codebook sizes. The results confirmed that larger codebook sizes considerably enhanced reconstruction quality, while extended contextual windows improved semantic modeling. Additional evaluation in noisy environments (LibriTTS Test-Other) and out-of-domain scenarios (LJSpeech) reaffirmed the robustness and adaptability of WavTokenizer.

Semantic Representation

For assessing semantic content, WavTokenizer was evaluated on the ARCH benchmark, which includes diverse datasets covering speech, music, and audio domains. The results on downstream classification tasks showcased WavTokenizer's capability to maintain and even enhance semantic richness compared to existing SOTA models, validating its potential for broader applications in generative and multimodal tasks.

Conclusion

WavTokenizer presents a significant advancement in the field of acoustic discrete codec tokenizers, offering substantial improvements in compression rates, reconstruction quality, and semantic richness. Its innovative approaches to quantization and decoder design set a new benchmark for future research in audio LLMing. Further developments and validations are anticipated in future releases, expanding its applicability and fine-tuning its performance across various domains.

Future Directions

The future directions in this line of research might include exploring the potential of WavTokenizer in real-time applications, optimizing the model for low-latency environments, and extending its capabilities to other audio-related tasks such as content-based retrieval and audio-based emotion recognition. Additionally, integrating WavTokenizer into larger multimodal frameworks could open avenues for even more sophisticated generative models and applications.