Emergent Mind

Abstract

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Overview of the HPLT acquisition and processing pipeline in the research.

Overview

  • Introduces a new dataset for language modeling and machine translation, featuring one of the largest multilingual text corpora.

  • Includes MonoHPLT with over 5.6 trillion word tokens in 75 languages, BiHPLT covering 18 English-centric language pairs, and MultiHPLT with 171 language pairs.

  • Releases 22 Machine Translation models and 9 Bicleaner AI models, alongside open-source tools for dataset processing.

  • Enables significant advancements in language technology, especially for low-resourced languages, and provides a blueprint for future large-scale language resource compilations.

Introduction to the HPLT Language Resources

Overview of the HPLT Language Resources

The High Performance Language Technologies (HPLT) project introduces a new dataset for language modeling and machine translation (MT) training, encompassing one of the largest publicly available multilingual text corpora. This dataset includes both monolingual and parallel corpora extracted from the web, leveraging web crawls produced by the Internet Archive and CommonCrawl. The project also releases a suite of open-source tools and models aligned with the dataset to facilitate processing and application of the resources.

Dataset Composition

The HPLT language resources encompass:

  • MonoHPLT: A monolingual dataset covering 75 languages, with over 5.6 trillion word tokens. This part of the dataset emphasizes low- to medium-resourced languages.
  • BiHPLT: A parallel dataset focusing on English-centric language pairs, covering 18 language pairs and more than 96 million aligned sentence pairs.
  • MultiHPLT: Synthetic datasets created by pivoting parallel datasets through English, covering 171 language pairs with 157 million sentence pairs.
  • The project also releases 22 Machine Translation models for bilingual document alignment and 9 Bicleaner AI models for sentence pair scoring.

Highlights and Contributions

The HPLT language resources present several notable contributions:

  1. Extensive Language Coverage: The dataset significantly contributes to the diversity of languages available for language technology development, particularly enhancing resources for low-resourced languages.
  2. Massive Scale: With trillions of word tokens across the monolingual datasets and hundreds of millions of aligned sentence pairs in the parallel corpus, the data volume is unprecedented among publicly released resources.
  3. Open Tools: Accompanying the dataset, the project releases a range of tools for managing, downloading, and processing large web-crawled corpora, enabling researchers to extend or replicate the dataset compilation process.
  4. Innovative Use of Web Crawls: The dataset incorporates previously unused web crawls from the Internet Archive, providing new text resources that were not available in other web-derived corpora.

Practical and Theoretical Implications

The availability of the HPLT language resources under a permissive CC0 license opens several avenues for research and development:

  • Training and Evaluation of LLMs: The sheer scale and diversity of the monolingual datasets offer a robust foundation for training LLMs, particularly in incorporating and evaluating low-resourced languages.
  • Advancements in MT: The parallel corpus, especially when considered alongside the synthetic datasets, presents significant resources for training and improving machine translation models across a wide range of language pairs.
  • Research in Data Compilation Techniques: The methodology applied in assembling the datasets, from web crawls to dataset processing and tooling, provides a valuable blueprint for future efforts in compiling large-scale language resources.

Future Directions

While the current release of the HPLT language resources marks a significant milestone, future developments are anticipated to expand language coverage further, enhance the dataset with more granular metadata, and extend tools for even more efficient processing. Additionally, the project aims to contribute models and training pipelines, enriching the ecosystem around the dataset.

Concluding Remarks

The HPLT language resources demonstrate the potential of leveraging web-derived data to create extensive, diverse, and accessible datasets for language technology research and development. By making these resources publicly available, the project not only facilitates immediate advancements in language modeling and machine translation but also sets the stage for future innovations in the field.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.