Cross-lingual Language Model Pretraining

Published 22 Jan 2019 in cs.CL | (1901.07291v1)

Abstract: Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual LLMs (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual LLM objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (2,629)

View on Semantic Scholar

Summary

The paper introduces a novel cross-lingual pretraining approach that combines unsupervised (MLM) and supervised (TLM) objectives to enhance multilingual understanding.
It employs a shared BPE sub-word vocabulary to improve language alignment and boost benchmark scores on tasks like XNLI and WMT translation.
Empirical results demonstrate significant gains, including up to 4–9 BLEU improvements and reduced perplexity for low-resource language modeling.

Cross-lingual LLM Pretraining

Overview

The paper "Cross-lingual LLM Pretraining" by Guillaume Lample and Alexis Conneau pioneers an exploration of pretraining LLMs with cross-lingual data. Their research introduces a refined methodology for learning cross-lingual LLMs (XLMs), showcasing the efficacy of leveraging unsupervised and supervised approaches for manifold natural language understanding and generation tasks. This paper is pivotal in advancing the development of general-purpose multilingual models, reducing reliance on massive parallel corpora, and improving performance across low-resource languages.

Methodology

The authors propose two primary pretraining strategies:

Unsupervised Cross-lingual Learning: Utilizing monolingual corpora through two objectives:
- Causal Language Modeling (CLM): Sequentially predicting the next word.
- Masked Language Modeling (MLM): Predicting a randomly masked word within a sentence, similar to BERT's approach.
Supervised Cross-lingual Learning: Leveraging parallel data with the Translation Language Modeling (TLM) objective, which processes parallel sentences to bolster the model's ability to align representations across languages.

A shared sub-word vocabulary is created using Byte Pair Encoding (BPE) to improve cross-lingual alignment, particularly effective for languages with common scripts or tokens.

Experiments and Results

The paper meticulously evaluates XLMs across various tasks, providing empirical evidence that cross-lingual pretraining substantially benefits natural language understanding and machine translation. Notable results include:

Cross-lingual Classification: On the XNLI benchmark, the unsupervised MLM method achieved 71.5% accuracy, surpassing the previous state of the art. The supervised MLM+TLM model reached 75.1% accuracy, demonstrating a significant benefit from parallel data. Fine-tuning on each language's training set pushed this further to 76.7%.
Unsupervised Machine Translation: For WMT'14 English-French, WMT'16 English-German, and English-Romanian, XLMs notably improved BLEU scores. The best model attained 34.3 BLEU on German-English, a leap of over 9 BLEU from prior methods.
Supervised Machine Translation: On WMT'16 Romanian-English, pretraining enhanced performance, culminating in a new state-of-the-art BLEU score of 38.5, exceeding previous benchmarks by 4 BLEU points.
Low-resource Language Modeling: Cross-lingual models decreased the perplexity of Nepali LLMs significantly, particularly when combining data from Hindi and English.
Unsupervised Cross-lingual Word Embeddings: XLM-derived embeddings showed superior performance in semantic similarity tasks and translation closeness metrics.

Implications and Future Work

The cross-lingual pretraining methods introduced in this paper indicate robust improvements across various NLP tasks, underscoring the potential of XLMs to transcend language boundaries. This work elucidates pathways for enhancing low-resource languages and cross-lingual applications, reducing the digital divide in language technologies.

Looking ahead, optimizing the trade-offs between unsupervised and supervised approaches could further elevate XLM efficacy. Additionally, exploring cross-lingual transfer learning for more diverse languages and domains might yield further advancements. The authors' intent to make their code and pretrained models publicly accessible will undoubtedly catalyze further research and development in the field.

In conclusion, this paper constitutes a significant step in the evolution of multilingual NLP, providing a foundation upon which future innovations can build.

Markdown Report Issue