Attention Is All You Need (1706.03762v7)
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
- Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.
- Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
- Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
- Recurrent neural network grammars. In Proc. of NAACL, 2016.
- Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
- Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
- Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
- Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
- Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
- Structured attention networks. In International Conference on Learning Representations, 2017.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
- A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
- Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
- Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
- Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006.
- A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
- Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006.
- Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
- Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
- Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
- Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.