Papers
Topics
Authors
Recent
2000 character limit reached

Direct speech-to-speech translation with a sequence-to-sequence model (1904.06037v2)

Published 12 Apr 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.

Citations (208)

Summary

  • The paper introduces Translatotron as an innovative end-to-end model that directly translates spoken language without traditional intermediate steps.
  • The model uses a sequence-to-sequence architecture with attention and auxiliary phoneme prediction to enhance translation accuracy.
  • Experimental results on Spanish-English datasets validate its feasibility while highlighting challenges in speaker adaptation and performance.

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

Introduction

This paper presents Translatotron, a model designed for the direct translation of spoken language. Unlike traditional speech-to-speech translation systems that decompose the task into components such as ASR, MT, and TTS, Translatotron employs an end-to-end approach, mapping input speech spectrograms directly to output spectrograms in another language. This approach aims to reduce the cumulative errors from sequential component processing and to retain paralinguistic features of the source speech, such as the speaker's voice and emotion.

Model Architecture

The Translatotron model architecture consists of sequential components trained to facilitate end-to-end translation. The sequence-to-sequence network provides the core functionality to translate source language speech into target language speech. This system uses an encoder to map the input features into hidden states processed through an attention mechanism, allowing an autoregressive decoder to predict log spectrogram frames of translated speech. The model includes optional speaker embedding to preserve speaker characteristics.

Auxiliary tasks enhance the learning process, with additional decoders predicting source and target phonemes to regularize the model. Training employs multitask learning strategies and leverages auxiliary recognition outputs to guide attention. The synthesis phase uses low-complexity vocoders such as Griffin-Lim for basic experimentation and WaveRNN for evaluations requiring higher quality outputs.

Experimental Evaluation

Extensive experiments were conducted using two datasets: a proprietary conversational Spanish-English corpus and the Fisher Spanish-English corpus. The Translatotron demonstrated comparable performance to conventional cascaded systems, establishing feasibility for direct model translation, although it showed slightly lower BLEU scores than the cascaded baselines. The experiments highlighted the importance of auxiliary phoneme prediction tasks to enable effective end-to-end learning.

The inclusion of a speaker encoder, particularly in voice transfer experiments, aimed to maintain source speaker characteristics in translated speech. Results varied, with notable performance lower than comparable TTS tasks, which necessitates further refinement of speaker generalization, especially across languages.

Results

Objective evaluations using BLEU scores pointed to the model's capacity to deliver intelligible and accurate translations, albeit with room for improvement compared to traditional methods. Subjective evaluations through MOS concerning naturalness and similarity underscored the varying synthesis qualities achieved. Translatotron's ability to reproduce speaker voice highlights its potential but also signals the need for additional strategies to enhance cross-language speaker adaptation.

Conclusion

The presented direct speech-to-speech model substantiates the viability of end-to-end translation systems, paving the way for reduced complexity and improved paralinguistic translation fidelity. The paper suggests exploring alternative training strategies to minimize reliance on transcription during training and proposes potential enhancements through adversarial learning and cycle consistency in voice transfer.

Future directions may involve scaling training with synthetic data or enhancing prosody transfer capabilities, offering promising avenues for further advancement in direct speech-to-speech translation methodologies.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com