SimulTron: On-Device Simultaneous Speech to Speech Translation (2406.02133v1)

Published 4 Jun 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device.

Summary

The paper introduces SimulTron, a novel on-device streaming S2ST model that significantly improves translation accuracy and latency on mobile devices.
The framework modifies the Translatotron architecture with a 16-layer causal Conformer encoder, wait-k LSTM decoder, and MelGAN vocoder for efficient real-time processing.
Experimental results on Conversational and MuST-C datasets demonstrate enhanced BLEU scores and practical deployment potential on devices like the Pixel 7 Pro.

SimulTron: On-Device Simultaneous Speech to Speech Translation

The research paper "SimulTron: On-Device Simultaneous Speech to Speech Translation" presents an innovative framework designed to facilitate real-time, on-device speech-to-speech translation (S2ST). This paper introduces SimulTron, a model built upon the well-established Translatotron architecture. The proposed system targets the significant challenge of achieving accurate, real-time S2ST on mobile devices, a task complicated by hardware limitations such as memory and processing power constraints.

Introduction and Background

The necessity for efficient S2ST systems has become more evident with the increasing importance of global communication. While previous S2ST models have achieved notable success in bypassing traditional cascade approaches—comprising separate stages for ASR, MT, and TTS—Simultaneous S2ST models still struggle in on-device, real-time scenarios due to mobile constraints. SimulTron aims to address these limitations by integrating essential modifications into the Translatotron architecture, making it suitable for streaming operations on resource-constrained devices.

Model Architecture

The SimulTron model structure incorporates three interconnected components operating in streaming mode: a Streaming Encoder, a Streaming Decoder, and a Streaming Vocoder.

Streaming Encoder: This encoder processes incoming audio frames in real-time, employing a 16-layer causal Conformer with a 2x subsampling layer designed to maintain low latency.
Streaming Decoder: Utilizing wait-k attention, this LSTM-based decoder generates mel-spectrogram frames which are then used to iteratively output the translated speech.
Streaming Vocoder: The MelGAN vocoder facilitates the final conversion of the generated mel-spectrogram into an audio waveform, completing the real-time translation pipeline.

The model structure and streaming capabilities allow SimulTron to initiate translation with an adjustable delay, balancing context and latency.

Experiments and Results

Experimental validation involved two primary datasets: the Conversational Dataset and the MuST-C Dataset.

Conversational Dataset:
- SimulTron demonstrated superior BLEU scores compared to Translatotron 1, achieving 51.2 for k=150 while maintaining real-time capabilities. The model showed a clear trade-off between the delay parameter (k) and performance. For offline evaluation, SimulTron achieved a BLEU score of 57.4, showcasing improvements over both Translatotron 1 and 2.
- Detailed evaluations included Mean Opinion Scores (MOS) that decreased with shorter input contexts (lower k values), highlighting quality degradation with less input context.
MuST-C Dataset:
- SimulTron surpassed the iTTS method by achieving a BLEU score of 14.7 with k=150 and significant latency improvements.

SimulTron outperformed prior real-time methods, particularly notable in the context of both translation accuracy and latency. The experimental results also demonstrated SimulTron's feasibility for deployment on mobile hardware, as evaluated on a Pixel 7 Pro device with optimized latency metrics.

Implications and Future Directions

SimulTron sets a precedent for achieving efficient, on-device, real-time S2ST. Its ability to function within the limited computational resources of mobile devices without compromising translation quality is significant. From a practical perspective, this advancement augments the accessibility and usability of S2ST technology, supporting linguistically diverse and technology-reliant communities.

Theoretically, the success of the causal conformer encoder and wait-k attention mechanisms in practical applications may inspire further refinement of streaming models. Future directions for this research could include the extension of SimulTron's capabilities to support a broader array of languages and further enhancements to accommodate diverse and acoustically challenging environments.

In conclusion, the research encapsulated in SimulTron represents a pivotal step in the evolution of mobile S2ST systems. By demonstrating the potential for accurate, real-time, on-device translations, this model paves the way for more inclusive and effective communication technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1798280746203017455

https://twitter.com/mctalentowen/status/1798265292734476289