The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge (2404.06079v2)

Published 9 Apr 2024 in eess.AS and cs.AI

Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, with the highest UTMOS score and lowest bitrate among all submissions.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel approach using discrete speech tokens that achieved top TTS performance with high UTMOS scores and the lowest bitrate.
The paper integrates semantic and acoustic tokens via advanced architectures including a conformer-based encoder and a non-autoregressive pipeline for ASR and SVS.
The paper demonstrates that tailored token strategies reduce error rates and improve naturalness, offering actionable insights for future speech processing research.

The X-LANCE Technical Report for Interspeech 2024: Speech Processing Using Discrete Speech Unit Challenge

Introduction

The paper details the contributions from the SJTU X-LANCE team to the Interspeech 2024 challenge, focusing on discrete speech tokens, which have gained prominence across several domains, including ASR, TTS, and SVS. The team's systems achieved notable results, particularly in the TTS track, where they topped the leaderboard with high UTMOS scores and achieved the lowest bitrate across all submissions.

TTS Track: The VQTTS System

The challenge in the TTS track required efficient use of discrete tokens to achieve low bitrate while maintaining high naturalness. The team employed FunCodec for discrete tokens and a modified VQTTS pipeline.

Discrete Tokens

Semantic and acoustic tokens were integrated to harness both content and signal reconstruction aspects. Wav2vec2.0 served as the semantic token provider, and FunCodec acted as the acoustic token with a significant reduction in bitrate.

Model Architecture

The acoustic model utilized a conformer-based encoder and a phoneme-level prosody controller, allowing precise prosody modeling and a fixed-step causal VQ decoder. The vocoder incorporated by the team leveraged the CTX-vec2wav vocoder design refined for single-speaker scenarios.

Figure 1: Architecture of the acoustic model (left) and vocoder (right) in the TTS track.

Experimental Results

The integration of aligned phonemes enhanced performance significantly, reducing WER and boosting UTMOS scores effectively in Wav2vec2.0 tokens. The FunCodec token, although less effective in WER during resynthesis, exhibited superior stability with fewer training data inputs.

SVS Track: The DOVSinger System

The SVS track demanded leveraging discrete tokens for singing synthesis with high fidelity. The employed technique centered around Descript Audio Codec (DAC) tokens with a refined VALL-E pipeline.

Discrete Tokens

DAC was chosen for its ability to compress high-fidelity audio effectively, specifically retaining the nuances in the singing voice dataset.

Model Architecture

The modified VALL-E framework involved a purely non-autoregressive inference pipeline, optimized to predict DAC codes across multiple layers with a focus on temporal coherence.

Figure 2: Architecture of the acoustic model (left) and vocoder (right) in the SVS track.

Experimental Results

The DOVSinger resulted in high pitch accuracy at reduced bitrates, although it compromised on perceptual quality due to the limited use of DAC layers in synthesis, resulting in a notable hoarseness in output vocals.

ASR Track

For the ASR track, the focus was on using semantic tokens for recognizing speech at a low bitrate. The WavLM model was pivotal in providing the discrete tokens.

ASR with Discretized Input

Utilizing WavLM-large and k-means clustering, the team developed an end-to-end ASR system optimized using RNN-T loss and incorporated robust token augmentation techniques.

Figure 3: Illustration of the pipeline for speech discrete tokens in the ASR track.

Model Architecture and Training

The selected neural Transducer architecture with a Zipformer encoder allowed for efficient sequence encoding, further refined by a stateless RNN-T framework.

Experimental Results

The system achieved a substantial reduction in CER, evidencing its effectiveness in handling speech recognition tasks, albeit at a relatively higher bitrate than the baseline.

Conclusion

In conclusion, the X-LANCE team's participation in the Interspeech 2024 challenge demonstrated a comprehensive understanding and application of discrete speech tokens across multiple domains. Their systems not only showcased competitive performance metrics but also shed light on effective token usage strategies that may influence future developments in speech processing technologies.