Emergent Mind

Abstract

The performance of a LLM depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

FineWeb datasets vs. other public datasets on various benchmarks.

Overview

  • The FineWeb datasets, FineWeb and FineWeb-Edu, provide high-quality pretraining data for LLMs sourced from 96 Common Crawl snapshots and a 1.3-trillion token educational subset, respectively.

  • Multiple dataset curation techniques, including text extraction, deduplication, and heuristic filtering, were employed to enhance data quality and LLM performance, with additional focus on minimizing biases.

  • The datasets, along with their curation processes and accompanying codebase, are openly shared, aiming to democratize access to high-quality LLM training data and foster innovation in AI research.

Overview of the FineWeb Datasets: A Novel Approach to High-Quality Pretraining Data for LLMs

Introduction

The study introduces the FineWeb datasets, specifically FineWeb and FineWeb-Edu, developed to address the significant role of high-quality large-scale pretraining datasets in the successful performance of LLMs. The research aims to minimize the knowledge gap by openly releasing these datasets along with comprehensive documentation of their curation processes. FineWeb is a 15-trillion token dataset derived from 96 Common Crawl snapshots, while FineWeb-Edu is a 1.3-trillion token subset filtered to prioritize educational content. The authors leveraged empirical methods to refine dataset curation strategies, focusing on enhancing LLM performance through meticulous filtering and deduplication.

Data Extraction and Processing

The research emphasizes the importance of efficient text extraction from Common Crawl data. Text was extracted from WARC files using the trafilatura library, deemed superior to WET data extractions. The initial filtering process involved URL filtering using fastText for language identification and employing quality and repetition filters borrowed from MassiveText.

Deduplication Strategies

The study explores multiple deduplication methodologies, ultimately adopting MinHash-based deduplication performed individually on each snapshot. Notably, applying global deduplication across all 96 snapshots led to only modest performance improvements due to increased data redundancy in older snapshots. Independent deduplication of each snapshot resulted in superior performance, aligning with the results obtained from RefinedWeb.

Custom Heuristic Filters

Further enhancing dataset quality, the authors applied additional heuristic filters inspired by the successful filtering strategies of the C4 dataset. These filters were empirically developed and tuned using high- and low-quality partitions of older Common Crawl snapshots. The resulting filters, focused on text coherence and repetition metrics, demonstrated notable performance enhancements with minimal data loss.

FineWeb-Edu: Focus on Educational Content

FineWeb-Edu represents an innovative approach by leveraging synthetic annotations generated by Llama-3-70B-Instruct to train a classifier for extracting educational content from FineWeb. The filtered 1.3-trillion token FineWeb-Edu dataset achieved significant performance gains on knowledge and reasoning-intensive benchmarks like MMLU and ARC, outperforming other openly accessible web-based datasets.

Bias Analysis

The authors conducted a bias analysis to uncover distributional skews related to gender, age, and religion terms within FineWeb and FineWeb-Edu datasets. While biases were generally low, the analysis revealed some skewed associations indicative of existing societal biases in the sourced web content. FineWeb-Edu demonstrated less biased associations, particularly aligning with educational content focusing on history and health.

Implications and Future Work

The introduction of FineWeb and FineWeb-Edu datasets marks a substantial contribution to the public domain, potentially reducing the disparity between proprietary and public knowledge. These datasets, along with the released curation codebase and trained models, pave the way for more transparent, efficient, and accessible LLM training.

The study has implications for both practical and theoretical advancements. Practically, the datasets provide a valuable resource for the research community to train high-performing LLMs without the prohibitive cost of dataset curation. Theoretically, the systematic approach to developing heuristic filters and the empirical validation of deduplication strategies contribute to a deeper understanding of optimal data curation practices.

Future work could explore incorporating additional data types such as books and specialized content, refining dataset curation on larger scales, and evaluating the datasets in more diverse application contexts. Additionally, further analyses could investigate the impact of specific filtering techniques on resistance to model biases and memorization.

Conclusion

The FineWeb datasets, encompassing FineWeb and FineWeb-Edu, set a new benchmark for public LLM pretraining datasets. By openly sharing not only the datasets but also the accompanying curation processes and codebase, the study significantly enhances the resources available to the research community, fostering further innovation and reducing the reliance on closed, proprietary datasets. The potential for these datasets to drive further developments in LLM curation and training is substantial, reflecting a pivotal step towards more democratized and transparent AI research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube