The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English (1902.01382v3)

Published 4 Feb 2019 in cs.CL

Abstract: For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLoRes evaluation datasets for Nepali-English and Sinhala-English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at https://github.com/facebookresearch/flores.

Citations (142)

View on Semantic Scholar

Summary

The paper introduces high-quality FLoRes evaluation datasets addressing the scarcity of parallel data in low-resource MT.
It details a rigorous data construction process using document selection, automatic filtering, and manual quality checks to ensure translation adequacy and fluency.
Experimental results show that semi-supervised techniques with back-translation and multilingual data significantly improve BLEU scores despite domain drift.

The Flores Evaluation Datasets for Low-Resource Machine Translation: Nepali--English and Sinhala--English

The paper presents the Flores evaluation datasets targeting the Nepali--English and Sinhala--English language pairs, which are characterized by their low-resource status due to the scarcity of parallel data. These datasets aim to address the dual challenges inherent in low-resource machine translation (MT): the lack of sufficient training data and the absence of reliable evaluation benchmarks.

Dataset Construction

The datasets are derived from Wikipedia articles and consist of professionally translated sentences. The paper provides detailed methodologies for data collection, including document selection, automatic filtering, and manual quality checks. This complex process ensures high-quality translations, reflecting both adequacy and fluency, with average translation scores above 70 being retained.

Learning Settings and Methodologies

The research explores multiple learning scenarios: fully supervised, unsupervised, semi-supervised, and weakly supervised, utilizing both existing parallel data and monolingual sources. Baseline experiments clearly illustrate the limitations of state-of-the-art methods when applied to these language pairs, as evidenced by notably low BLEU scores. Supervised and semi-supervised models outperformed unsupervised ones, which struggled due to inadequate word embedding initialization caused by non-comparable monolingual corpora.

Experimental Insights

One of the insightful findings is the effectiveness of semi-supervised approaches that incorporate back-translation. This method notably improves BLEU scores, especially when coupled with multilingual data involving Hindi-English parallel corpora. The research underscores the utility of combining data from linguistically related languages to enhance low-resource MT performance.

Further, the paper highlights the domain drift impact, as existing parallel datasets appear closer to English Wikipedia content. This domain mismatch contributes significantly to the translation challenges faced, emphasizing the importance of domain-aligned training data.

Implications and Future Directions

The Flores datasets establish a robust and publicly available benchmark that fills a critical gap in the MT research landscape, encouraging further exploration of low-resource language pairs. The paper invites the research community to leverage these datasets for developing innovative MT systems. Additionally, the results point to potential future research areas, such as enhancing domain adaptation techniques and exploring deeper multilingual approaches to further bridge performance gaps.

In conclusion, the Flores evaluation datasets represent a significant contribution for evaluating and advancing low-resource machine translation methodologies, with an emphasis on practical applicability and comprehensive evaluation.