Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 165 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 445 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

A New Massive Multilingual Dataset for High-Performance Language Technologies (2403.14009v1)

Published 20 Mar 2024 in cs.CL

Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for LLMing and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces an unprecedented multilingual dataset combining trillions of tokens with extensive parallel corpora for language modeling advancements.
The methodology leverages web crawls from the Internet Archive and CommonCrawl to extract diverse monolingual and parallel texts, especially for low-resourced languages.
The release includes open-source tools and models that facilitate scalable training and evaluation for machine translation and large language models.

Introduction to the HPLT Language Resources

Overview of the HPLT Language Resources

The High Performance Language Technologies (HPLT) project introduces a new dataset for LLMing and machine translation (MT) training, encompassing one of the largest publicly available multilingual text corpora. This dataset includes both monolingual and parallel corpora extracted from the web, leveraging web crawls produced by the Internet Archive and CommonCrawl. The project also releases a suite of open-source tools and models aligned with the dataset to facilitate processing and application of the resources.

Dataset Composition

The HPLT language resources encompass:

MonoHPLT: A monolingual dataset covering 75 languages, with over 5.6 trillion word tokens. This part of the dataset emphasizes low- to medium-resourced languages.
BiHPLT: A parallel dataset focusing on English-centric language pairs, covering 18 language pairs and more than 96 million aligned sentence pairs.
MultiHPLT: Synthetic datasets created by pivoting parallel datasets through English, covering 171 language pairs with 157 million sentence pairs.
The project also releases 22 Machine Translation models for bilingual document alignment and 9 Bicleaner AI models for sentence pair scoring.

Highlights and Contributions

The HPLT language resources present several notable contributions:

Extensive Language Coverage: The dataset significantly contributes to the diversity of languages available for language technology development, particularly enhancing resources for low-resourced languages.
Massive Scale: With trillions of word tokens across the monolingual datasets and hundreds of millions of aligned sentence pairs in the parallel corpus, the data volume is unprecedented among publicly released resources.
Open Tools: Accompanying the dataset, the project releases a range of tools for managing, downloading, and processing large web-crawled corpora, enabling researchers to extend or replicate the dataset compilation process.
Innovative Use of Web Crawls: The dataset incorporates previously unused web crawls from the Internet Archive, providing new text resources that were not available in other web-derived corpora.

Practical and Theoretical Implications

The availability of the HPLT language resources under a permissive CC0 license opens several avenues for research and development:

Training and Evaluation of LLMs: The sheer scale and diversity of the monolingual datasets offer a robust foundation for training LLMs, particularly in incorporating and evaluating low-resourced languages.
Advancements in MT: The parallel corpus, especially when considered alongside the synthetic datasets, presents significant resources for training and improving machine translation models across a wide range of language pairs.
Research in Data Compilation Techniques: The methodology applied in assembling the datasets, from web crawls to dataset processing and tooling, provides a valuable blueprint for future efforts in compiling large-scale language resources.

Future Directions

While the current release of the HPLT language resources marks a significant milestone, future developments are anticipated to expand language coverage further, enhance the dataset with more granular metadata, and extend tools for even more efficient processing. Additionally, the project aims to contribute models and training pipelines, enriching the ecosystem around the dataset.

Concluding Remarks

The HPLT language resources demonstrate the potential of leveraging web-derived data to create extensive, diverse, and accessible datasets for language technology research and development. By making these resources publicly available, the project not only facilitates immediate advancements in LLMing and machine translation but also sets the stage for future innovations in the field.