Learning Deep Transformer Models for Machine Translation

Published 5 Jun 2019 in cs.CL and cs.LG | (1906.01787v1)

Abstract: Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English and larger WMT'18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (618)

View on Semantic Scholar

Summary

The paper introduces an enhanced Transformer model using deeper encoder networks with a pre-norm configuration to improve training stability.
It employs a dynamic linear combination of layer outputs to effectively integrate contextual information across multiple layers.
Empirical results on WMT'16 and NIST datasets reveal BLEU score gains and increased computational efficiency compared to Transformer-Big models.

Deep Transformer Models for Machine Translation

The paper "Learning Deep Transformer Models for Machine Translation" explores the potential of deeper encoder networks to enhance neural machine translation (NMT) by leveraging the Transformer architecture. Building on previous work that primarily focuses on expanding network width via Transformer-Big models, this research emphasizes depth to overcome limitations in learning deep networks.

The authors introduce refinements to the Transformer model, allowing it to support significantly deeper encoder structures. The methodology involves two main innovations: optimizing the position of layer normalization and implementing a dynamical linear combination of layer outputs. These modifications aim to alleviate the optimization challenges such as vanishing or exploding gradients typically encountered in deep network training.

Key Innovations

Layer Normalization: The paper distinguishes between pre-norm and post-norm placements of layer normalization within the Transformer architecture. By moving the layer normalization operation, deep networks can be more effectively optimized, particularly when using the pre-norm configuration, which aligns the normalization with input elements rather than the output.
Dynamic Linear Combination of Layers (DLCL): Inspired by the linear multi-step method in numerical analysis, DLCL is proposed to integrate features learned across multiple layers using weighted combinations. This technique aims to retain more contextual information from earlier layers throughout the network depth, reducing the risks associated with standard residual connections.

Empirical Results

The authors provide empirical evidence on multiple datasets, including WMT'16 English-German and NIST OpenMT'12 Chinese-English. Their experiments reveal that deep encoder networks outperform traditional shallow networks and even rival Transformer-Big models in translation quality, measured in BLEU score improvements of 0.4 to 2.4 points. Moreover, the deep models demonstrated higher computational efficiency, being 1.6 times smaller in size and three times faster in training compared to Transformer-Big models.

Implications and Future Work

The work challenges the existing paradigm focusing on model width by establishing the efficacy of model depth through a well-thought-out design that addresses known training difficulties. The results suggest that substantial gains in BLEU scores can be achieved without increasing model size, which has essential implications for deploying NMT systems in resource-constrained environments.

Theoretically, this approach might spur further exploration into other deep neural architectures, suggesting potential applications beyond machine translation, including language modeling and other areas where large transformers already find utility. It opens avenues for future research to explore dynamic combinations in even deeper layers and across different neural network architectures.

This research presents a noteworthy step in NMT architecture design, underscoring the importance of depth and optimization in leveraging the full potential of the Transformer model.

Markdown Report Issue