End-to-End Automatic Speech Translation of Audiobooks (1802.04200v1)

Published 12 Feb 2018 in cs.CL

Abstract: We investigate end-to-end speech-to-text translation on a corpus of audiobooks specifically augmented for this task. Previous works investigated the extreme case where source language transcription is not available during learning nor decoding, but we also study a midway case where source language transcription is available at training time only. In this case, a single model is trained to decode source speech into target text in a single pass. Experimental results show that it is possible to train compact and efficient end-to-end speech translation models in this setup. We also distribute the corpus and hope that our speech translation baseline on this corpus will be challenged in the future.

Citations (188)

View on Semantic Scholar

Summary

The paper introduces an end-to-end approach that directly translates spoken audiobook content into another language without intermediate transcription.
It employs a compact encoder-decoder model with convolutional layers, bidirectional LSTMs, and attention mechanisms enhanced by pre-training and multi-task learning.
Experimental results demonstrate competitive performance against cascaded ASR-MT systems, highlighting the importance of data alignment and model architecture choices.

End-to-end Automatic Speech Translation of Audiobooks: A Technical Overview

The paper "End-to-end Automatic Speech Translation of Audiobooks" investigates the challenges and methodologies involved in translating spoken language directly into another language text without intermediary transcription steps. This paper is specifically set in the context of audiobooks, utilizing the LibriSpeech corpus, which has been augmented to facilitate end-to-end speech translation tasks.

Methodology

Traditional Spoken Language Translation (SLT) systems often operate in a cascaded manner, integrating Automatic Speech Recognition (ASR) to convert speech into text, followed by Machine Translation (MT) to translate the text into the target language. The distinctiveness of this research lies in exploring an end-to-end approach, wherein the model translates audio directly to another language text, bypassing the need for intermediate transcription.

This investigation embarks on two fronts:

Extreme Scenario: No source language transcription is available during training or decoding.
Midway Scenario: Transcriptions are available at the training stage, but not during decoding. This approach allows for a compact model capable of decoding source speech into target text in a single pass.

The Audiobook Corpus

The researchers extended the LibriSpeech dataset, which traditionally serves ASR tasks, by aligning English speech with French text to form the Augmented LibriSpeech corpus. This corpus consists of 236 hours of spoken English from the LibriSpeech dataset aligned with both derived and machine-translated French text. The alignment process incorporated parsing public domain literary works from LibriVox and the Gutenberg Project.

Model Architecture and Training

The paper employs encoder-decoder models with attention mechanisms to perform the translation tasks. Specifically:

The speech encoder combines convolutional layers for initial feature extraction, followed by bidirectional LSTMs, resulting in a sequence of annotations to be used by the decoder.
The decoder operates at the character level, leveraging a conditional LSTM design to generate the target language output.

Training procedures incorporate multi-task learning and pre-training strategies, notably enhancing performance by utilizing source transcripts during training. The models were trained with alternative updates applied across ASR, MT, and direct speech translation tasks.

Experimental Results and Implications

Experiments were performed both on the synthetic BTEC corpus and the augmented LibriSpeech corpus. Findings highlighted that while a cascaded ASR-MT system generally yields superior performance, the proposed end-to-end models perform competitively. Specifically, the results demonstrated:

Compact end-to-end models are feasible and effective, closely approximating the performance of cascaded systems.
Pre-training and multi-task learning significantly uplift performance, particularly when source transcriptions are available during model training.
The extent of available aligned data and model architectural choices critically influence the efficacy of the learned systems.

Future Directions

The augmented LibriSpeech corpus stands as a valuable asset for the community, inviting further research to enhance end-to-end automatic speech translation models. The exploration of larger and more diverse datasets, combined with architectural innovations, holds promise for the development of more robust and efficient models capable of performing speech-to-text translation directly within more diverse contexts.

In conclusion, this paper lays foundational work for end-to-end automatic speech translation of audiobooks, emphasizing methodological innovations and offering a comprehensive dataset to spur future advancements in the field of Speech Translation.

PDF Markdown