Direct Neural Machine Translation with Task-level Mixture of Experts models (2310.12236v2)
Abstract: Direct neural machine translation (direct NMT) is a type of NMT system that translates text between two non-English languages. Direct NMT systems often face limitations due to the scarcity of parallel data between non-English language pairs. Several approaches have been proposed to address this limitation, such as multilingual NMT and pivot NMT (translation between two languages via English). Task-level Mixture of expert models (Task-level MoE), an inference-efficient variation of Transformer-based models, has shown promising NMT performance for a large number of language pairs. In Task-level MoE, different language groups can use different routing strategies to optimize cross-lingual learning and inference speed. In this work, we examine Task-level MoE's applicability in direct NMT and propose a series of high-performing training and evaluation configurations, through which Task-level MoE-based direct NMT systems outperform bilingual and pivot-based models for a large number of low and high-resource direct pairs, and translation directions. Our Task-level MoE with 16 experts outperforms bilingual NMT, Pivot NMT models for 7 language pairs, while pivot-based models still performed better in 9 pairs and directions.
- Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
- Maruan Al-Shedivat and Ankur P Parikh. 2019. Consistency by agreement in zero-shot neural machine translation. arXiv preprint arXiv:1904.02338.
- The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Lisa Ballesteros and Mark Sanderson. 2003. Addressing the lack of direct translation resources for cross-language retrieval. In Proceedings of the twelfth international conference on Information and knowledge management, pages 147–152.
- Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
- Laurent Besacier. 2014. Machine translation for litterature: a pilot study (traduction automatisée d’une oeuvre littéraire: une étude pilote) [in French]. In Proceedings of TALN 2014 (Volume 2: Short Papers), pages 389–394, Marseille, France. Association pour le Traitement Automatique des Langues.
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A teacher-student framework for zero-resource neural machine translation. arXiv preprint arXiv:1705.00753.
- Zero-resource neural machine translation with multi-agent communication game. In Proceedings of the aaai conference on artificial intelligence, volume 32.
- Yong Cheng and Yong Cheng. 2019. Joint training for pivot-based neural machine translation. Joint training for neural machine translation, pages 41–54.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Anna Currey and Kenneth Heafield. 2019. Zero-resource neural machine translation with monolingual pivot data. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 99–107.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
- Language tokens: A frustratingly simple approach improves zero-shot performance of multilingual translation. arXiv preprint arXiv:2208.05852.
- A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res, 23:1–40.
- Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073.
- Zero-resource translation with multi-lingual neural machine translation. arXiv preprint arXiv:1606.04164.
- Markus Freitag and Orhan Firat. 2020. Complete multilingual neural machine translation. arXiv preprint arXiv:2010.10239.
- Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Yves Gambier and Luc Van Doorslaer. 2016. Border crossings: Translation studies and other disciplines, volume 126. John Benjamins Publishing Company.
- Improved zero-shot neural machine translation via ignoring spurious correlations. arXiv preprint arXiv:1906.01181.
- Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
- Tutel: Adaptive mixture-of-experts at scale. arXiv preprint arXiv:2206.03382.
- Cross-lingual pre-training based transfer for zero-shot neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 115–122.
- Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Pivot-based transfer learning for neural machine translation between non-english languages. arXiv preprint arXiv:1909.09524.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742.
- Zero-shot neural machine translation with self-learning cycle. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 96–113.
- James Lee-Thorp and Joshua Ainslie. 2022. Sparse mixers: Combining moe and mixing to build a more efficient bert. arXiv preprint arXiv:2205.12399.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
- Improving zero-shot translation by disentangling positional information. arXiv preprint arXiv:2012.15127.
- A neural interlingua for multilingual machine translation. arXiv preprint arXiv:1804.08198.
- Multimodal contrastive learning with limoe: the language-image mixture of experts.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Monolingual adapters for zero-shot neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4465–4470, Online. Association for Computational Linguistics.
- Scalable transfer learning with expert models. CoRR, abs/2009.13239.
- COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Subword segmentation and a single bridge language affect zero-shot neural machine translation. arXiv preprint arXiv:2011.01703.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595.
- Max Ryabinin and Anton Gusev. 2020. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts.
- BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
- Kristiina Taivalkoski-Shilov. 2019. Ethical issues regarding machine (-assisted) translation of literary texts. Perspectives, 27(5):689–703.
- Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.
- Jörg Tiedemann. 2018. Emerging language spaces learned from massively multilingual corpora. arXiv preprint arXiv:1802.00273.
- Attention is all you need. Advances in neural information processing systems, 30.
- Rethinking zero-shot neural machine translation: From a perspective of latent variables. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4321–4327, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Residual mixture of experts. arXiv preprint arXiv:2204.09636.
- Eag: Extract and generate multi-way aligned corpus for complete multi-lingual neural machine translation. arXiv preprint arXiv:2203.02180.
- UM4: Unified multilingual multiple teacher-student model for zero-resource neural machine translation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization.
- Improving multilingual translation by representation and gradient regularization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7266–7279, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts. CoRR, abs/2105.03036.
- Speechmoe2: Mixture-of-experts model with improved routing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7217–7221. IEEE.
- Improving massively multilingual neural machine translation and zero-shot translation. arXiv preprint arXiv:2004.11867.
- Maximum expected likelihood estimation for zero-resource neural machine translation. In IJCAI, pages 4251–4257.
- St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906.
- Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 30–34, San Diego, California. Association for Computational Linguistics.