VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers (2406.05370v2)

Published 8 Jun 2024 in cs.CL, cs.SD, and eess.AS

Abstract: This paper introduces VALL-E 2, the latest advancement in neural codec LLMs that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. See https://aka.ms/valle2 for demos of VALL-E 2.

Authors (9)

Sanyuan Chen (28 papers)
Shujie Liu (101 papers)
Long Zhou (57 papers)
Yanqing Liu (48 papers)
Xu Tan (164 papers)
Jinyu Li (164 papers)
Sheng Zhao (75 papers)
Yao Qian (37 papers)
Furu Wei (291 papers)

Citations (37)

View on Semantic Scholar

Summary

The paper introduces VALL-E 2, achieving human parity in zero-shot TTS by incorporating Repetition Aware Sampling and Grouped Code Modeling.
It reduces sequence lengths and improves decoding stability to deliver superior performance on LibriSpeech and VCTK benchmarks with lower WER and enhanced DNSMOS and SMOS scores.
The model paves the way for practical applications, from accessible communication aids to advanced virtual assistants, setting new standards for future TTS research.

Overview of VALL-E 2: Achieving Human Parity in Zero-Shot Text-to-Speech Synthesis

The paper "VALL-E 2: Neural Codec LLMs are Human Parity Zero-Shot Text to Speech Synthesizers" introduces VALL-E 2, an advanced neural codec LLM designed to achieve human parity in zero-shot text-to-speech (TTS) synthesis. This advancement builds upon its predecessor, VALL-E, and incorporates significant enhancements to improve decoding stability and modeling efficiency.

Key Innovations

VALL-E 2 employs two pivotal modifications to enhance the performance and efficiency of the predecessor models:

Repetition Aware Sampling (RAS):
- RAS refines the nucleus sampling process by considering token repetition in the decoding history. This method adjusts between random and nucleus sampling based on token repetition, enhancing stability and avoiding infinite loops previously encountered.
Grouped Code Modeling (GCM):
- GCM organizes codec codes into groups, effectively reducing the sequence length. This modification not only accelerates inference but also alleviates issues related to long sequence modeling, thereby improving overall performance.

Experimental Findings

The evaluation results on the LibriSpeech and VCTK datasets demonstrate that VALL-E 2 exceeds prior models in key areas such as robustness, naturalness, and speaker similarity. Notably, the model achieves human parity benchmarks, evidenced by superior subjective and objective metrics.

LibriSpeech Results:

Objective Metrics: VALL-E 2 exhibited enhanced performance with a significant reduction in Word Error Rate (WER) and improvements in DNSMOS scores. For instance, it achieved WER scores even better than the ground truth speech in specific settings, underscoring its robustness and accuracy.
Subjective Metrics: The model surpassed VALL-E in both Speaker Mean Opinion Score (SMOS) and Comparative Mean Opinion Score (CMOS), indicating better speaker similarity and naturalness.

VCTK Results:

Objective Metrics: Similar trends were observed, with VALL-E 2 significantly lowering WER and improving DNSMOS scores across prompt lengths of 3s, 5s, and 10s.
Subjective Metrics: VALL-E 2 demonstrated superior performance over VALL-E and achieved scores comparable to or even surpassing ground truth speech.

Implications and Future Developments

The implications of VALL-E 2 are profound for both practical and theoretical arenas in AI research. On a practical level, this advancement can lead to the development of TTS systems capable of generating natural, human-like speech from previously unseen speakers with minimal enroLLMent data. Such systems could be invaluable in applications ranging from aiding individuals with speech impairments to enhancing virtual assistants and communication aids.

Theoretically, VALL-E 2's success in reducing sequence lengths and improving decoding techniques sets a new benchmark for future TTS model developments. The introduction of RAS and GCM demonstrates innovative ways to balance stability and efficiency in autoregressive models, providing a blueprint for addressing similar challenges in other neural LLMing tasks.

Conclusion

VALL-E 2 marks a significant evolution in zero-shot TTS synthesis, achieving human parity through thoughtful advancements in sampling and modeling techniques. As researchers continue to explore and refine these innovations, the implications for AI-driven communication devices and accessibility technologies are vast. This work not only sets a new standard for speech synthesis but also opens up promising avenues for further research and application in human-computer interaction.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1800369384600261115

https://twitter.com/PuyuanPeng/status/1803254954842374145

https://twitter.com/fly51fly/status/1800648482560110932

https://twitter.com/agi2025/status/1800361721317745018

https://twitter.com/gm8xx8/status/1800366199080985076

https://twitter.com/knishimae0531/status/1800671936818954589

YouTube

Show All Videos