Emergent Mind

Abstract

Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

Model ablations show varying data mix impacts on web, code datasets, and S2ORC benchmarks.

Overview

  • Dolma is a three trillion-token English corpus for LM pretraining with diverse data sources.

  • It is designed for transparent and reproducible research, with care to minimize harmful content.

  • A toolkit accompanying Dolma ensures quality and safeguards against sensitive data in the LM training process.

  • Experiments with Dolma demonstrate its effectiveness and contribution to language model performance.

  • The corpus supports the ethos of open and responsible AI research in language modeling.

Introduction

Language models (LMs) are critical for a wide array of natural language processing tasks, from question answering to summarization. However, research into the specifics of LM development, particularly about the composition of their pretraining data, is often obscured, either due to proprietary secrecy or lack of comprehensive documentation. Detailing and releasing an extensive pretraining dataset can address this oversight and propel open research. This is where the dataset termed "Dolma" comes into play - a publicly available, three trillion-token English corpus crafted from a variety of sources including web content, scientific literature, software code, public-domain literature, and encyclopedic materials.

Dolma Design Goals

The dataset was sculpted with specific design requirements to enhance transparency and reproducibility in LM research. These included consistency with existing language model pretraining practices, scale compatibility for large model training, contributions to the public domain, and extensive efforts to minimize the risks of harm from potentially sensitive content. Not only does Dolma match scale and diversity with known pretraining corpora, but it also amplifies this with meticulous curation to limit the use of potentially harmful data such as personal identification information and derogatory content.

Data Curation and Toolkit

A high-performance toolkit has been developed to efficiently process large volumes of text for language model pre-training. This toolkit serves multiple purposes: extensive language filtering to ensure English-only content, various quality filtering techniques to eliminate low-quality text entries, and implementation of deduplication methods at different granularities. Moreover, to make this corpus valuable for training and analyzing the performance of language models, additional filtering has been carried out to mask or remove personal data, and to mitigate the spread of undesired, including toxic, content.

Experiments and Findings

The paper includes a range of experiments that measure model performance on domain fit and downstream tasks at various intermediate states of Dolma. Such ablation studies are paramount in understanding how different subsets impact LM capabilities. For instance, the inclusion of source code data was found to benefit reasoning-related tasks, shedding light on the importance of multi-source mixing in dataset construction. Various experiments provided insights into the optimization of content filters to balance data quality with the breadth of information.

The release also marks the use of Dolma to train OLMo, a state-of-the-art open language model, showcasing its effectiveness. What stands out is Dolma's versatility and readiness for various research avenues, given the open-source nature of both the dataset and the accompanying curation toolkit.

Conclusion

Dolma anchors the commitment to transparency and scrutiny in language model training. It sets a new benchmark for dataset scale, diversity, and curation quality, paving the way for more inclusive and less biased research in the field of language modeling. This corpus invites the broader community into a collaborative effort towards advances in language model research, anchored in principles of openness and responsible AI.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube