Direct speech-to-speech translation with a sequence-to-sequence model (1904.06037v2)

Published 12 Apr 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.

Citations (208)

View on Semantic Scholar

Summary

The paper introduces Translatotron, an end-to-end sequence-to-sequence model that eliminates intermediate text representations for direct speech translation.
It employs an attention-based multi-layer LSTM encoder with auxiliary recognition tasks to learn effective alignments during training.
Experiments on Spanish-English corpora reveal potential for real-time translation and personalized voice transfer, despite lower BLEU scores compared to cascaded systems.

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

This paper introduces an innovative approach to the task of speech-to-speech translation (S2ST) through a single, unified sequence-to-sequence neural network model termed Translatotron. The authors present a novel architecture which translates speech from one language directly into another without relying on intermediate text representations, a contrast to conventional systems that incorporate separate ASR, MT, and TTS stages. This approach seeks to address issues of compounding errors and latency observed in cascaded systems, showcasing an integrated end-to-end training strategy.

Model Architecture and Training

The Translatotron framework employs an attention-based sequence-to-sequence model that directly maps input spectrograms in one language to output spectrograms in another. The architecture consists of key components, including a multi-layer bidirectional LSTM encoder, multiple decoder networks for auxiliary recognition tasks, and a single primary spectrogram decoder leveraging multi-head attention mechanisms. The primary spectrogram decoder is tasked with generating log spectrograms of the translated output, which are subsequently converted to time-domain waveforms using a Griffin-Lim or WaveRNN vocoder, depending on evaluation specifics.

A noteworthy aspect of this work is the integration of auxiliary recognition tasks aiming to facilitate learning effective alignments during training. These auxiliary tasks predict phoneme sequences corresponding to both source and target language utterances, offering a form of supervision instrumental in learning robust attention mechanisms. Notably, no text-based intermediate representations are utilized during inference, aligning with the end-to-end nature of the model.

Experimental Results and Analysis

The researchers evaluate Translatotron on two datasets: a large-scale proprietary Spanish-to-English corpus designed for textual translation, and the Fisher Spanish corpus containing spontaneous telephone conversations. In both cases, the model does not yet match the performance of traditional cascaded systems, as reflected in BLEU scores. Nonetheless, these results illustrate the approach's validity and highlight areas that warrant further exploration.

Interestingly, the authors take a step forward in synthesizing translated speech using the voice of the source speaker, enabled by a pre-trained speaker encoder network. This exploration into cross-language voice conversion indicates potential applications in personalized translation technologies, though challenges in maintaining speaker similarity are evident.

Subsequent analysis shows that the model trained with auxiliary tasks performs significantly better than those trained without, pointing toward the necessity of additional supervision in complex tasks such as S2ST without intermediates. The experiments also reveal differences between synthetic target training and cross-language voice transfer scenarios, underscoring the nuanced challenges and potential directions for improving model generalization.

Implications and Future Directions

The implications of this paper extend to both practical applications and theoretical advancements in machine translation and speech processing. Practical benefits include the potential for reduced latency and computational efficiency offered by a single-step S2ST model, as well as improved preservation of prosody and speaker characteristics in the translated speech. The research also presents foundational work that could guide further attempts to integrate machine translation with speech technologies in linguistic data-rich contexts.

Looking ahead, several directions are identified for future research. These include leveraging weak supervision strategies to scale training data, improving voice transfer capabilities through adversarial or cycle-consistent training, and extending model capabilities to incorporate varied prosodic and acoustic elements from source speech into translations. Such advancements could propel the development of rich and expressive multilingual communication tools relevant across diverse applications.

In conclusion, the research posits a promising alternative to traditional cascaded S2ST systems, opening avenues for more integrated and seamless translation experiences. While preliminary, Translatotron represents an important step towards achieving direct S2ST, setting a robust foundation for continued exploration and refinement.

PDF Markdown

Related Papers

YouTube

Show All Videos