The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (2406.17557v2)

Published 25 Jun 2024 in cs.CL

Abstract: The performance of a LLM depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

Authors (8)

Guilherme Penedo (7 papers)
Hynek Kydlíček (6 papers)
Loubna Ben Allal (12 papers)
Anton Lozhkov (7 papers)
Margaret Mitchell (43 papers)
Colin Raffel (83 papers)
Leandro von Werra (20 papers)
Thomas Wolf (118 papers)

Citations (66)

View on Semantic Scholar

Summary

The paper presents the FineWeb datasets—a 15-trillion token corpus and its 1.3-trillion token educational subset—designed to enhance LLM performance.
It details innovative text extraction and independent snapshot deduplication methods that outperform global approaches for superior data quality.
The study’s bias analysis and public data release promote transparency and reduce reliance on proprietary datasets in LLM training.

Overview of the FineWeb Datasets: A Novel Approach to High-Quality Pretraining Data for LLMs

Introduction

The paper introduces the FineWeb datasets, specifically FineWeb and FineWeb-Edu, developed to address the significant role of high-quality large-scale pretraining datasets in the successful performance of LLMs. The research aims to minimize the knowledge gap by openly releasing these datasets along with comprehensive documentation of their curation processes. FineWeb is a 15-trillion token dataset derived from 96 Common Crawl snapshots, while FineWeb-Edu is a 1.3-trillion token subset filtered to prioritize educational content. The authors leveraged empirical methods to refine dataset curation strategies, focusing on enhancing LLM performance through meticulous filtering and deduplication.

Data Extraction and Processing

The research emphasizes the importance of efficient text extraction from Common Crawl data. Text was extracted from WARC files using the trafilatura library, deemed superior to WET data extractions. The initial filtering process involved URL filtering using fastText for language identification and employing quality and repetition filters borrowed from MassiveText.

Deduplication Strategies

The paper explores multiple deduplication methodologies, ultimately adopting MinHash-based deduplication performed individually on each snapshot. Notably, applying global deduplication across all 96 snapshots led to only modest performance improvements due to increased data redundancy in older snapshots. Independent deduplication of each snapshot resulted in superior performance, aligning with the results obtained from RefinedWeb.

Custom Heuristic Filters

Further enhancing dataset quality, the authors applied additional heuristic filters inspired by the successful filtering strategies of the C4 dataset. These filters were empirically developed and tuned using high- and low-quality partitions of older Common Crawl snapshots. The resulting filters, focused on text coherence and repetition metrics, demonstrated notable performance enhancements with minimal data loss.

FineWeb-Edu: Focus on Educational Content

FineWeb-Edu represents an innovative approach by leveraging synthetic annotations generated by Llama-3-70B-Instruct to train a classifier for extracting educational content from FineWeb. The filtered 1.3-trillion token FineWeb-Edu dataset achieved significant performance gains on knowledge and reasoning-intensive benchmarks like MMLU and ARC, outperforming other openly accessible web-based datasets.

Bias Analysis

The authors conducted a bias analysis to uncover distributional skews related to gender, age, and religion terms within FineWeb and FineWeb-Edu datasets. While biases were generally low, the analysis revealed some skewed associations indicative of existing societal biases in the sourced web content. FineWeb-Edu demonstrated less biased associations, particularly aligning with educational content focusing on history and health.

Implications and Future Work

The introduction of FineWeb and FineWeb-Edu datasets marks a substantial contribution to the public domain, potentially reducing the disparity between proprietary and public knowledge. These datasets, along with the released curation codebase and trained models, pave the way for more transparent, efficient, and accessible LLM training.

The paper has implications for both practical and theoretical advancements. Practically, the datasets provide a valuable resource for the research community to train high-performing LLMs without the prohibitive cost of dataset curation. Theoretically, the systematic approach to developing heuristic filters and the empirical validation of deduplication strategies contribute to a deeper understanding of optimal data curation practices.

Future work could explore incorporating additional data types such as books and specialized content, refining dataset curation on larger scales, and evaluating the datasets in more diverse application contexts. Additionally, further analyses could investigate the impact of specific filtering techniques on resistance to model biases and memorization.

Conclusion

The FineWeb datasets, encompassing FineWeb and FineWeb-Edu, set a new benchmark for public LLM pretraining datasets. By openly sharing not only the datasets but also the accompanying curation processes and codebase, the paper significantly enhances the resources available to the research community, fostering further innovation and reducing the reliance on closed, proprietary datasets. The potential for these datasets to drive further developments in LLM curation and training is substantial, reflecting a pivotal step towards more democratized and transparent AI research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gui_penedo/status/1867221441487524093

https://twitter.com/TheTuringPost/status/1808576515292213559

https://twitter.com/fly51fly/status/1806083946981281972

https://twitter.com/KyleDevinOBrien/status/1821575786031399401

https://twitter.com/javaeeeee1/status/1806105049145147835

https://twitter.com/susumuota/status/1809740296319574109

YouTube

Show All Videos