Language-Codec: Bridging Discrete Codec Representations and Speech Language Models (2402.12208v4)

Published 19 Feb 2024 in eess.AS and cs.SD

Abstract: In recent years, LLMs have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech LLMs. Specifically, 1) Due to the reconstruction paradigm of the Codec model and the structure of residual vector quantization, the initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. 2) numerous codebooks increases the burden on downstream speech LLMs. Consequently, leveraging the characteristics of speech LLMs, we propose Language-Codec. In the Language-Codec, we introduce a Masked Channel Residual Vector Quantization (MCRVQ) mechanism along with improved fourier transform structures and attention blocks, refined discriminator design to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech LLMs. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec .

Citations (10)

View on Semantic Scholar

Summary

The paper introduces the innovative Masked Channel Residual Vector Quantization (MCRVQ) to effectively distribute audio information across codec channels.
Experimental results on LibriTTS and LJSpeech demonstrate improved UTMOS, PESQ, and STOI metrics compared to baseline models.
The framework enhances downstream TTS applications by significantly improving speaker similarity and reducing word error rates in zero-shot models.

Language-Codec: Bridging Discrete Codec Representations and Speech LLMs

Introduction

The paper "Language-Codec: Bridging Discrete Codec Representations and Speech LLMs" presents an innovative approach called Language-Codec, which aims to ameliorate the gaps between discrete acoustic codec representations and speech LLMs. Discrete codecs have become pivotal in high-quality generative tasks involving speech and audio, replacing traditional mel-spectrograms with more compact representations. However, discrepancies arise due to limited training datasets and inefficient codec designs. Language-Codec proposes the Masked Channel Residual Vector Quantization (MCRVQ) mechanism coupled with enhanced training paradigms to bolster compatibility with downstream speech models.

Figure 1: The overall architecture for Language-Codec. On the far left is the encoder downsampling module, which still utilizes the model structure of Encodec. On the far right is the decoder upsampling module, where we have replaced it with Vocos' model structure. the middle part is the Masked Channel Residual Vector Quantization module, with the gray blocks indicating the masked portion of temporal information.

Framework and Architecture

The core innovation of Language-Codec lies in its architectural adjustments and training refinements. The encoder component, derived from Encodec, is responsible for downsampling audio inputs to a latent space. In contrast, the decoder segment employs Vocos' FFT-based upsampling structure, enhancing time-domain to frequency-domain transformations. The transformative aspect is the MCRVQ module, which diversifies information across codec channels, mitigating information overload in initial channels.

MCRVQ utilizes a parallel and serial hybrid quantization scheme that enforces channel masking in the initial quantization stages. This setup distributes audio information uniformly, reducing dependency on initial codec layers and streamlining downstream tasks like speech synthesis. The architecture additionally comprises improved Fourier transform capabilities, robust discriminator designs, and a data-driven approach through extensive datasets covering 50,000 hours.

Experimental Results

Evaluation Metrics

Performance evaluations were conducted using standard datasets like LibriTTS and LJSpeech. Key metrics included UTMOS, PESQ, STOI, and speaker similarity, providing a comprehensive understanding of audio quality and codec model efficacy. Lower bitrates with minimal channel numbers were targeted to reflect practical deployment scenarios.

The empirical results underscore Language-Codec's superiority. Across both LibriTTS and LJSpeech datasets, Language-Codec outperforms baseline models such as Encodec and Vocos in UTMOS, PESQ, and STOI metrics. Especially in challenging noisy environments (LibriTTS Test-Other), Language-Codec maintains robust audio reconstructive fidelity, a testament to its generalization prowess.

Downstream Applications

For zero-shot text-to-speech (TTS) models, both VALL-E and MobileSpeech benefitted from Language-Codec's representations, evidenced by enhanced speaker similarity and reduced word error rates (WER). Particularly, MobileSpeech models demonstrated superior MOS-Q and MOS-S scores, affirming the codec's adaptability to different TTS architectures.

Future Directions and Implications

The successful integration of the MCRVQ mechanism showcases a viable pathway for bridging the informational chasm between discrete codecs and speech models. Future work could explore adaptive codebook resizing and multi-scale quantization to further refine codec efficacy in dynamically varying linguistic contexts.

The practical implications are significant, as Language-Codec provides a robust, scalable solution for high-quality speech generation and comprehension tasks, potentially elevating the standards for interactive voice systems and media communications.

Conclusion

Language-Codec marks a significant milestone in discrete codec development. By addressing core issues of codec-channel information distribution and leveraging comprehensive training datasets, it sets a new benchmark in speech and audio processing. As low-rate, high-quality codecs become pivotal in multimedia applications, Language-Codec's foundational insights will inform future advancements in AI-driven audio synthesis.