Emergent Mind

Abstract

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

Overview

  • The study investigates the impact of corpus quality on Neural Machine Translation (NMT) performance, focusing on web-mined parallel corpora for Sinhala-English, Tamil-English, and Sinhala-Tamil language pairs.

  • It introduces a novel method for corpus quality assessment using sentence similarity rankings, enabling a more precise analysis of data quality within web-mined corpora.

  • Key findings suggest that training NMT models with a high-quality segment of a web-mined corpus significantly outperforms models trained on the entire, unfiltered corpus.

  • The research challenges the assumption of uniform noise in web-mined data, showing that careful curation and filtering can lead to NMT outcomes comparable to those of human-curated corpora.

Evaluating the Impact of Corpus Quality on NMT Performance

Introduction to Corpus Quality in NMT

The performance of Neural Machine Translation (NMT) models is significantly influenced by the quality and quantity of available training data. While LLMs have made strides in NMT, especially for high-resource languages, low-resource languages continue to struggle due to a paucity of parallel corpora. Publicly available, web-mined parallel corpora offer a potential solution by providing vast amounts of bitext for hundreds of languages. However, the inherent noisiness of such datasets, particularly for low-resource languages, has been a cause for concern. Contrary to previous assumptions that the noise in web-mined corpora is uniformly distributed, this paper presents evidence suggesting that a filtered selection of high-quality sentences from these corpora can yield NMT performance on par with that of models trained on human-curated datasets.

Unpacking the Quality Variance in Web-mined Corpora

The study focuses on three language pairs: Sinhala-English, Tamil-English, and Sinhala-Tamil. By employing sentence similarity rankings, the research diverges from traditional quality assessment methods that utilize small, random samples. This approach facilitated a detailed analysis of the quality spectrum within these corpora, distinguishing high-quality segments from their low-quality counterparts. The intrinsic evaluation—comprising human evaluations—and the extrinsic evaluation, which involved training NMT systems with these segmented corpora, constitute the core of the study’s methodology.

Key Findings and Implications

  • Performance Dichotomy: Training NMT models with just the top 25,000 sentences from a web-mined corpus significantly outperformed models trained on the entirety of the same corpus. This highlights a stark performance dichotomy based on corpus quality, even within web-mined datasets.
  • Optimal Corpus Segment: The research identified that, on average, utilizing the top-performing segment of a web-mined corpus (in terms of quality) can achieve optimal NMT performance. Remarkably, these results are comparable to those from models trained on human-curated corpora.
  • Noise Identification: An in-depth human evaluation aimed at categorizing the types of noise present in the top-quality segments of the corpora sheds light on the nuanced challenges of utilizing web-mined data for NMT.

Theoretical and Practical Contributions

This research clarifies the nuanced role of data quality in NMT performance, especially regarding low-resource languages. The findings challenge the prevailing notion of uniform noise distribution in web-mined corpora, suggesting that careful curation and quality filtering can substantially improve NMT outcomes. Practically, the study provides a blueprint for leveraging the vast, yet noisy, resources of web-mined corpora more effectively.

Directions for Future Research

The implications of these findings are profound for the continued development of NMT, particularly for languages suffering from a scarcity of high-quality parallel corpora. Future research could expand this evaluative framework to other languages and corpora, refining the methodologies for identifying and extracting high-quality segments of web-mined data. Additionally, further exploration into automated quality assessment and filtering techniques could enhance the efficiency and scalability of preparing web-mined corpora for NMT training.

Concluding Remarks

The study underscores the importance of discerning the quality within web-mined corpora, challenging researchers to rethink their strategies for corpus selection and utilization in NMT. By demonstrating that a meticulously filtered subset of a web-mined corpus can rival the performance of a human-curated dataset, the paper advocates for a more nuanced approach to leveraging the abundant, albeit noisy, data available for language translation tasks. This research not only contributes valuable insights to the field of NMT but also paves the way for more resourceful and effective use of web-mined data in addressing the challenges faced by low-resource languages.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.