Direct speech-to-speech translation with discrete units

Published 12 Jul 2021 in cs.CL, cs.LG, and eess.AS | (2107.05604v2)

Abstract: We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages. Audio samples are available at https://facebookresearch.github.io/speech_translation/direct_s2st_units/index.html .

Abstract PDF Upgrade to Chat

Citations (157)

View on Semantic Scholar

Summary

The paper presents a novel direct S2ST approach using self-supervised discrete speech representations, eliminating intermediate text generation.
The methodology leverages the HuBERT framework to predict discrete units, achieving a 6.7 BLEU improvement and improved computational efficiency over spectrogram-based models.
The proposed model demonstrates strong potential for unwritten languages and resource-constrained settings, paving the way for more inclusive translation technologies.

Direct Speech-to-Speech Translation With Discrete Units

The study presented in the paper by Ann Lee et al. revolves around a direct speech-to-speech translation (S2ST) model. The model enables translation from speech in one language to another without intermediate text generation, which is notably different from more traditional cascaded speech-to-text (S2T) systems that rely on both automatic speech recognition (ASR) and machine translation (MT). This work represents a significant step forward in translating spoken language, especially for unwritten languages, by directly leveraging self-supervised discrete representations.

Methodology

The core of the approach lies in applying a self-supervised discrete speech encoder to transform target speech into discrete representations. The study utilizes the HuBERT framework to achieve this, where a sequence-to-sequence speech-to-unit translation (S2UT) model is trained to predict these discrete representations. This is in contrast to prior direct S2ST models that primarily focused on predicting continuous spectrogram features. The advantage of using discrete units is emphasized as it separates linguistic information from that of speaker identity and prosody, hence easing the modeling complexity.

Furthermore, when text transcripts are available, the authors introduce a joint training framework that allows simultaneous generation of speech and text outputs during inference. This framework employs a shared encoder with partly shared decoders alongside incorporation of connectionist temporal classification (CTC) to resolve length discrepancies between speech and text outputs. Experimentally, the model demonstrates an improvement of 6.7 BLEU on the Fisher Spanish-English dataset compared to baseline models that predict spectrogram features. Notably, when trained without text transcripts, the model matches the efficacy of text-supervised spectrogram-predicting models.

Experiments and Results

The empirical analysis is extensive, utilizing both synthetic datasets, such as the Fisher Spanish-English corpus, and performance evaluations through BLEU scores and subjective mean opinion score (MOS) tests. Key experimental findings include:

Direct S2ST Advantage: The S2UT model, particularly when using reduced discrete representations, demonstrated superior performance over conventional spectrogram-targeted models across multiple metrics. This suggests potential scalability and applicability to unwritten languages where text transcripts are inherently lacking.
Computational Efficiency: The proposed model offers significant reductions in computational load and memory usage during inference. It was observed to be faster and less resource-intensive than both the direct S2ST models with spectrogram outputs and multi-stage cascaded systems.
Practical Implications: The results indicate practical utility in circumstances where computational power is constrained, enhancing the applicability of speech translation technologies in resource-scarce settings.

Implications and Future Work

This research has noteworthy implications for the development of automated translation technologies, presenting possibilities for expansion into unwritten and under-resourced languages. The efficacy of leveraging self-supervised learning frameworks like HuBERT in direct speech translation tasks opens pathways to enhancing machine learning models further.

For theoretical considerations, the ability to disentangle linguistic features through discrete units marks a potential shift in model architectures that favor end-to-end learning over cascaded frameworks.

The researchers suggest future explorations with actual large-scale S2S data rather than synthetic datasets could further validate their findings. Moreover, the integration of non-autoregressive models for both translation and synthesis processes could enhance the real-time applicability of these models, offering further gains in efficiency.

This study lays a robust foundation for continued advancements in direct speech translation models, illustrating a compelling case for the use of discrete units in bridging linguistic divides more efficiently and inclusively.

Markdown