Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159v2)

Published 31 Jan 2024 in cs.CL

Abstract: Information about pretraining corpora used to train the current best-performing LLMs is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on LLMing, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on LLM pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

Citations (156)

View on Semantic Scholar

Summary

The paper introduces Dolma, a 3-trillion-token corpus designed to advance transparency and reproducibility in language model pretraining.
It details a comprehensive toolkit that filters, deduplicates, and safely processes diverse data sources including web, literature, and code.
Experimental findings demonstrate that mixing sources, particularly source code, significantly enhances model reasoning and overall performance.

Introduction

LLMs (LMs) are critical for a wide array of natural language processing tasks, from question answering to summarization. However, research into the specifics of LM development, particularly about the composition of their pretraining data, is often obscured, either due to proprietary secrecy or lack of comprehensive documentation. Detailing and releasing an extensive pretraining dataset can address this oversight and propel open research. This is where the dataset termed "Dolma" comes into play - a publicly available, three trillion-token English corpus crafted from a variety of sources including web content, scientific literature, software code, public-domain literature, and encyclopedic materials.

Dolma Design Goals

The dataset was sculpted with specific design requirements to enhance transparency and reproducibility in LM research. These included consistency with existing LLM pretraining practices, scale compatibility for large model training, contributions to the public domain, and extensive efforts to minimize the risks of harm from potentially sensitive content. Not only does Dolma match scale and diversity with known pretraining corpora, but it also amplifies this with meticulous curation to limit the use of potentially harmful data such as personal identification information and derogatory content.

Data Curation and Toolkit

A high-performance toolkit has been developed to efficiently process large volumes of text for LLM pre-training. This toolkit serves multiple purposes: extensive language filtering to ensure English-only content, various quality filtering techniques to eliminate low-quality text entries, and implementation of deduplication methods at different granularities. Moreover, to make this corpus valuable for training and analyzing the performance of LLMs, additional filtering has been carried out to mask or remove personal data, and to mitigate the spread of undesired, including toxic, content.

Experiments and Findings

The paper includes a range of experiments that measure model performance on domain fit and downstream tasks at various intermediate states of Dolma. Such ablation studies are paramount in understanding how different subsets impact LM capabilities. For instance, the inclusion of source code data was found to benefit reasoning-related tasks, shedding light on the importance of multi-source mixing in dataset construction. Various experiments provided insights into the optimization of content filters to balance data quality with the breadth of information.

The release also marks the use of Dolma to train OLMo, a state-of-the-art open LLM, showcasing its effectiveness. What stands out is Dolma's versatility and readiness for various research avenues, given the open-source nature of both the dataset and the accompanying curation toolkit.

Conclusion

Dolma anchors the commitment to transparency and scrutiny in LLM training. It sets a new benchmark for dataset scale, diversity, and curation quality, paving the way for more inclusive and less biased research in the field of LLMing. This corpus invites the broader community into a collaborative effort towards advances in LLM research, anchored in principles of openness and responsible AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1753255587012387164

https://twitter.com/srush_nlp/status/1777682366338646155

https://twitter.com/cwolferesearch/status/1760018034276647404

https://twitter.com/fly51fly/status/1753546879001112960

https://twitter.com/MeetweenEU/status/1831673214713393170

https://twitter.com/ceobillionaire/status/1760342286469148754

YouTube

Show All Videos