Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation (2006.10369v4)

Published 18 Jun 2020 in cs.CL

Abstract: Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that autoregressive baselines can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. We show that the speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Our results establish a new protocol for future research toward fast, accurate machine translation. Our code is available at https://github.com/jungokasai/deep-shallow.

Citations (128)

View on Semantic Scholar

Summary

The paper introduces a deep encoder and shallow decoder strategy that achieves comparable translation accuracy to standard methods with significant speed gains.
Extensive experiments on WMT datasets demonstrate that optimizing encoder-decoder depth improves word order capture and overall computational efficiency.
The findings suggest a paradigm shift towards enhancing autoregressive models, offering efficient and high-quality translation for real-world applications.

Deep Encoder, Shallow Decoder: Reevaluating Non-Autoregressive Machine Translation

The paper under review investigates the field of neural machine translation (NMT), focusing particularly on a comparison between autoregressive (AR) and non-autoregressive (NAR) models. The authors present a critical reexamination of the tradeoffs between translation speed and quality that are inherent to these models, especially when executed on modern GPUs. They challenge existing assumptions within the NAR paradigm and introduce a novel strategy to enhance autoregressive models through a deep encoder and a shallow decoder configuration.

Background and Main Hypothesis

In neural machine translation, autoregressive models generate translations sequentially, producing each word based on the previously generated sequence. This method, while precise, can be time-consuming due to its sequential nature. Conversely, NAR models allow for parallel generation of target tokens, ostensibly offering speed gains by eliminating dependency on preceding tokens. Historically, however, NAR models have struggled with translation quality, as they inadequately model the complex dependencies between words.

The cornerstone of this paper is the hypothesis that AR models, traditionally at a speed disadvantage compared to NAR models, can be significantly accelerated by optimizing the depth distribution between encoders and decoders. Specifically, the authors propose that employing a deeply layered encoder paired with a shallow decoder can mitigate speed issues without compromising translation quality.

Methodology and Experiments

The authors conduct extensive experiments, addressing three main influences on speed evaluation:

Suboptimal layer allocation between encoder and decoder.
Inadequate speed measurement practices.
The practice of knowledge distillation.

They benchmark various configurations using datasets from WMT14, WMT16, and WMT17 across different language pairs, including English-German and English-Chinese translation directions. A particular point of investigation is the effect of knowledge distillation, which is often applied more extensively in NAR models despite its benefits for AR models as well.

AR models are tested with varied encoder and decoder depths to demonstrate the proposed deep encoder, shallow decoder model's strengths. The assessment focuses on two cases for speed measurement: single-sentence (S1) and maximum-batch scenarios (Smax), reflecting interactive use and bulk processing, respectively.

Findings

Key findings from the research are as follows:

Autoregressive models with a 12-layer encoder and a single-layer decoder achieve comparable translation accuracy to the standard 6-layer configurations but offer drastic speed improvements. This is particularly evident in S1 scenarios.
NAR models, despite achieving theoretical speed advantages in S1 conditions, often fall behind AR in large-batch scenarios due to increased computational costs from iterative refinements.
Knowledge distillation boosts performance across both AR and NAR models, but AR models maintained a larger accuracy-sustainability advantage when distillation is evenly applied.
The deeper encoder structure not only improved speed but also ensured that word order nuances across languages were better captured in AR models, reducing dependency on costly decoder operations.

Practical and Theoretical Implications

The results presented in the paper suggest a paradigm shift in the design of NMT systems. There lies potential in optimizing autoregressive models rather than solely seeking alternative decoding schemes in NAR models. Practically, this could translate to more efficient machine translation systems, applicable in real-world scenarios requiring both high throughput and maintenance of translation quality.

Theoretically, the paper opens a line of inquiry into the balance of computational resources between encoder and decoder components in sequence transduction tasks. Future research could explore even more nuanced configurations and extend this approach to other domains beyond machine translation, such as sequence prediction and automated summarization using large pre-trained LLMs.

Overall, the paper advocates for a considered reevaluation of existing biases towards assumed disadvantages of autoregressive models in terms of speed, thereby offering a viable path forward in achieving efficient and high-quality neural machine translations.