Papers
Topics
Authors
Recent
2000 character limit reached

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation (2107.08661v5)

Published 19 Jul 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a linguistic decoder, an acoustic synthesizer, and a single attention module that connects them together. Experimental results on three datasets consistently show that Translatotron 2 outperforms the original Translatotron by a large margin on both translation quality (up to +15.5 BLEU) and speech generation quality, and approaches the same of cascade systems. In addition, we propose a simple method for preserving speakers' voices from the source speech to the translation speech in a different language. Unlike existing approaches, the proposed method is able to preserve each speaker's voice on speaker turns without requiring for speaker segmentation. Furthermore, compared to existing approaches, it better preserves speaker's privacy and mitigates potential misuse of voice cloning for creating spoofing audio artifacts.

Citations (59)

Summary

  • The paper presents a novel direct S2ST method integrating a speech encoder, LSTM-based linguistic decoder, and duration-based acoustic synthesizer to improve translation quality.
  • It employs an innovative voice preservation technique by using augmented training data and cross-lingual TTS models to maintain speaker characteristics.
  • Experimental evaluations show enhanced BLEU scores and natural speech generation across multiple languages, outperforming cascade systems.

Translatotron 2: Direct Speech-to-Speech Translation with Voice Preservation

Introduction

Translatotron 2 represents a significant advancement in the domain of direct speech-to-speech translation (S2ST), a technology designed to overcome linguistic barriers by converting spoken words from one language to another without intermediate text representation. Historically, S2ST systems have relied on a cascade architecture comprising automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. Translatotron 2 challenges this paradigm by offering a direct S2ST approach, improving upon its predecessor in both translation accuracy and speech quality. The model integrates a speech encoder, linguistic decoder, and acoustic synthesizer connected by a single attention module.

Methodology

Model Architecture

Translatotron 2 was meticulously designed to address inherent limitations of the original Translatotron. The model architecture consists of a speech encoder utilizing Conformer blocks, a linguistic decoder powered by LSTM layers, and an acoustic synthesizer that refines signal generation with a duration-based methodology. The single attention module harmonizes linguistic decoding and acoustic synthesis, simplifying the alignment between source and target speech. Figure 1

Figure 1

Figure 1: Overview of Translatotron 2.

Voice Preservation

A novel voice preservation technique enables Translatotron 2 to maintain the integrity of speaker characteristics seamlessly during translations, addressing privacy concerns and misuse potential associated with voice cloning technologies. The implementation utilizes augmented training data synthesized with consistent voice properties across languages, leveraging sophisticated TTS models capable of cross-lingual voice transfer. Figure 2

Figure 2: Sample mel-spectrograms on input with speaker turns. Translatotron~2 preserves the voices of each speaker in the translation speech.

Experimental Evaluation

Translation Quality and Speech Robustness

The experimental outcomes on datasets such as Fisher Spanish-English and CoVoST 2 demonstrate Translatotron 2’s performance superiority over its predecessor and potential competition with cascade S2ST systems. It achieves remarkable BLEU score improvements, showcasing enhanced translation precision. Moreover, the model mitigates over-generation errors such as babbling, presenting robustness comparable to well-established cascade models.

Natural Speech Generation

Subjective listening tests affirm the high naturalness of generated speech, substantiating Translatotron 2’s capability to synthesize audio that aligns closely with human-like speech quality, an essential requirement for practical deployment.

Multilingual Application

The model’s adaptability across multiple languages further attests to its versatility. Multilingual S2ST tasks reveal consistent performance enhancements, bolstered by architectural robustness and strategic augmentation techniques like ConcatAug, facilitating effective handling of speaker transitions. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Affinity matrices of d-vector similarity among 100 random examples. Predictions from Translatotron 2 demonstrate clear preservation of speaker characteristics.

Conclusion

Translatotron 2 exemplifies significant advancements in direct S2ST, offering improved translation accuracy, naturalness, and voice preservation capabilities. Its innovative architecture addresses critical performance gaps and introduces robust solutions to privacy and misuse challenges. The versatility across languages supports a wide range of applications, paving the way for future developments in more inclusive and efficient language translation technologies.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com