Emergent Mind

Abstract

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

Average document length percentage across seven languages in each corpus, highlighting non-fully running text instances.

Overview

  • The paper conducts an extensive evaluation of four web-crawled corpora (CC100, MaCoCu, mC4, and OSCAR) for training language models across eleven lower-resourced European languages, focusing on data quality.

  • A manual evaluation phase, involving professional linguists, highlighted significant quality differences among the corpora, with MaCoCu and OSCAR showing higher quality in terms of coherence and accuracy.

  • An automatic evaluation phase demonstrated that despite quality disparities, CC100 achieved superior performance in downstream tasks, suggesting a complex relationship between data quality and language model efficacy.

  • The findings challenge the assumption that higher quality data always leads to better language model performance, promoting a reevaluation of criteria for curating training corpora and exploring how language models can adapt to varying data quality.

Evaluating the Impact of Text Quality in Web-Crawled Corpora for Language Model Training

Introduction to Corpus Quality and Language Model Performance

The deployment of web-crawled corpora in training language models (LMs) represents a cornerstone of recent advancements in NLP. These corpora, often vast and unstructured, raise critical questions about the role of data quality on LM performance. In response to this, an insightful paper spans an extensive evaluation across four populous web-crawled corpora, namely CC100, MaCoCu, mC4, and OSCAR, focusing on their qualitative aspects and the subsequent effects on training language models across eleven lower-resourced European languages.

Manual Evaluation: A Dive into Data Quality

The researchers embarked on a detailed examination of the aforementioned corpora through a dual-phase evaluation. The manual evaluation entailed professional linguists scrutinizing data quality based on a multi-tiered annotation scheme. This phase revealed significant qualitative differences among the corpora. MaCoCu and OSCAR emerged as front-runners in quality, exhibiting a higher prevalence of publishable and coherent running text, while mC4 exhibited notable deficiencies, especially in representing the Maltese language, where a large fraction of data was inaccurately labeled or lacked coherence.

Automatic Evaluation: Exploring Language Model Performance

The study advanced to an automatic evaluation by training language models on segments of these corpora for a subset of five languages. Surprisingly, despite the quality disparities highlighted during the manual evaluation, CC100 led in achieving superior performance in downstream tasks, suggesting a complex relationship between raw data quality and LM efficacy. This phase interestingly showcased that the quality of data, as judged through human evaluation, does not straightforwardly translate to expected outcomes in LM training, underscoring the resilience of LMs to diverse data quality.

Implications and Future Directions

The findings prompt a reevaluation of the criteria for curating training corpora for LMs, especially under the premise that sheer data volume does not equate to quality. This research unequivocally shifts the discourse towards understanding the nuanced dynamics between data quality and LM performance, challenging the prevailing notion that higher quality datasets invariably lead to superior model performance.

This work also sets the stage for future explorations into the mechanisms through which LMs can adapt to or leverage variances in data quality. It opens up pathways for developing more robust models that can efficiently handle the intricacies presented by web-crawled corpora, particularly for lower-resourced languages that often suffer from data paucity and quality issues.

In conclusion, this study contributes a critical perspective to the ongoing dialogue on the optimization of data sources for LM training, providing evidence-based insights that question established assumptions and pave the way for refining data curation practices and model training methodologies in the evolving landscape of natural language processing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.