End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification (1811.04719v1)

Published 12 Nov 2018 in cs.CL

Abstract: Autoregressive decoding is the only part of sequence-to-sequence models that prevents them from massive parallelization at inference time. Non-autoregressive models enable the decoder to generate all output symbols independently in parallel. We present a novel non-autoregressive architecture based on connectionist temporal classification and evaluate it on the task of neural machine translation. Unlike other non-autoregressive methods which operate in several steps, our model can be trained end-to-end. We conduct experiments on the WMT English-Romanian and English-German datasets. Our models achieve a significant speedup over the autoregressive models, keeping the translation quality comparable to other non-autoregressive models.

Citations (165)

View on Semantic Scholar

Summary

The paper proposes an end-to-end non-autoregressive NMT model using Connectionist Temporal Classification (CTC) to overcome the sequential decoding limitations of autoregressive models.
By reframing translation as a sequence labeling problem with a modified Transformer decoder, the model achieves around 4x speedup and 80-90% of autoregressive BLEU scores on WMT datasets.
This approach enables significantly faster translation inference suitable for real-time services and lays groundwork for future quality enhancements via techniques like iterative denoising or external language models.

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

The paper by Libovický and Helcl presents an innovative advancement in the field of neural machine translation (NMT) through the development of a non-autoregressive model leveraging Connectionist Temporal Classification (CTC). The motivation for this research stems from the computational limitations associated with autoregressive models, which necessitate sequential execution during decoding, thereby inhibiting parallelization and increasing inference time complexity.

Theoretical and Methodological Insights

The paper introduces a non-autoregressive NMT framework with an end-to-end training protocol using CTC. Traditionally, autoregressive NMT models calculate the probability of each output symbol conditioned on previously decoded symbols, necessitating serial processing. In contrast, the proposed model allows for the parallel generation of all output symbols, significantly enhancing computational efficiency. The non-autoregressive model achieves this by reframing translation as a sequence labeling problem rather than sequence prediction.

An integral part of this architecture is the use of a modified Transformer structure. While the encoder configuration remains akin to the conventional Transformer, the decoder operates independently of its previous outputs, facilitated by omitting the temporal mask in the self-attention mechanism. This approach effectively allows for a near-constant time complexity due to parallel processing capabilities. The model uses a split factor where encoder output states are elongated, allowing generation beyond the input length which is essential for sequence labeling.

Experimental Setup and Results

The authors conducted experiments on the WMT English-Romanian and English-German datasets to evaluate the model's performance. Results demonstrate that while maintaining translation quality comparable to other non-autoregressive methods, the proposed model achieves significant speedups over autoregressive counterparts. Specifically, a reported 4x speedup was observed, although this gain was less pronounced compared to some previous works, potentially due to differences in implementation overheads.

Quantitatively, the model narrows the performance gap with autoregressive models to achieve around 80-90% of their BLEU scores. Three architectural variations were tested: deep encoder, encoder-decoder, and encoder-decoder with positional encoding. The encoder-decoder approach often outperformed the deep encoder, highlighting the benefits of increased model complexity despite a fixed computational footprint.

Implications and Future Directions

The implications of this research are notable for both practical applications and theoretical explorations in machine translation. The reduction in inference time without severely compromising translation quality suggests that non-autoregressive models could be leveraged in real-time translation services where latency is critical.

Future work could expand on enhancing translation quality through iterative denoising, synthesized from prior non-autoregressive research yet retaining the non-autoregressive inference benefit. Also, incorporating an external LLM in a beam search framework presents a promising avenue for improvement, aligning with practices seen in other sequence prediction domains such as speech recognition.

This paper thus embodies a substantive step towards more efficient neural machine translation systems, potentially catalyzing further innovations in optimizing model architectures for large-scale and real-time applications.