OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Published 10 Oct 2023 in cs.AI, cs.CL, and cs.LG | (2310.06786v1)

Abstract: There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of LLMs. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter LLMs on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of LLMs.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (52)

View on Semantic Scholar

Summary

The paper introduces a novel dataset of 14.7 billion tokens curated to support enhanced mathematical reasoning in LLMs.
It employs a four-stage pipeline—prefiltering, extraction, filtering, and deduplication—to ensure high-quality and relevant math content.
Empirical evaluations show that models trained on OpenWebMath outperform those using general datasets in mathematical tasks.

A Critical Examination of OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

The paper "OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text" presents an ambitious endeavor to address the evident lacuna in freely accessible large-scale datasets catering specifically to mathematical content. This dataset, encompassing 14.7 billion tokens, seeks to fill the void left by proprietary datasets such as those used for the Minerva model but unavailable for open-source research. By leveraging data scraped from Common Crawl, the authors aim to facilitate sophisticated mathematical reasoning in LLMs through enhanced training on genuine mathematical web text.

Methodology and Dataset Construction

The authors have meticulously curated the OpenWebMath dataset by applying a four-stage processing pipeline designed to optimize the extraction of mathematical content while effectively cleansing the dataset of redundancies and low-quality data:

Prefiltering: The use of a stack of pre-filters targeting common mathematical coding and keywords ensures a high recall of relevant documents. This layer significantly reduces computational burdens by circumventing non-mathematical content early in the process.
Text Extraction: Customization plays a pivotal role here, with the authors utilizing Resiliparse— lauded for its balance between efficiency and boilerplate removal—to parse extensive mathematical HTML content. Intriguingly, the dataset retains LaTeX formatting, a feature often mutilated in typical text extraction processes.
Filtering: Subsequent filters refine the dataset by employing language identification, mathematical content classifiers, and KenLM perplexity models. This step secures the dataset's focus on English high-grade mathematical text.
Deduplication and Inspection: A threshold-based SimHash method removes near duplicate content, further purified by manual inspection to ensure the authenticity and relevance of retained data.

Dataset Analysis and Benchmarking

OpenWebMath stands on par with, if not exceeding, some of the largest collections of mathematics-focused tokens, though its approach to data filtering, deduplication, and preservative extraction of mathematical encoding nuances (e.g., LaTeX delimiters) distinguishes it from predecessors. The dataset's diversity across domains is evident, covering a gamut from forums to formal educational and reference content. A distinct advantage sits in its wide-ranging domain applicability, extending to physics, computer science, and other technical areas.

The rigor of this dataset is manifest in the empirical performance evaluations. Models trained on OpenWebMath demonstrate superior per-token effectiveness in mathematical reasoning tasks compared to those trained on voluminous but general-domain datasets like The Pile. This provides quantifiable support for the dataset's role in effectively advancing the reasoning abilities of LLMs in specialized applications — a crucial insight for ongoing AI research.

Implications and Future Directions

The introduction of OpenWebMath highlights the growing appreciation for domain-specific data in enhancing the cognitive capabilities of LLMs. Its implications are twofold: practically, it stands poised to act as a vital resource for improving computational reasoning capabilities within AI; theoretically, it charts a course for future inquiries into data curation strategies that balance volume with specificity and quality.

Moreover, possibilities for long-term integration of OpenWebMath's extraction and filtering methodologies into broader AI training pipelines propose extensive permutations in data preparation approaches. Future work may profitably explore optimizations and extensions for non-English and multimodal (text-plus-visual) data contexts, tapping into the dataset's foundational structures for diversified language support.

In conclusion, OpenWebMath not only bridges the gap between proprietary inaccessibility and open-source innovation but also exemplifies meticulous attention to integrity in data curation. As computational narratives increasingly call for intricate mathematical reasoning, OpenWebMath offers a vital piece in the evolving puzzle of AI endeavors.

Markdown Report Issue