QuRating: Selecting High-Quality Data for Training Language Models (2402.09739v3)

Published 15 Feb 2024 in cs.CL and cs.LG

Abstract: Selecting high-quality pre-training data is important for creating capable LLMs, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter LLMs on the selected data. We find that it is important to balance quality and diversity. When we sample using quality ratings as logits over documents, our models obtain lower perplexity and stronger in-context learning performance than baselines. Our best model is based on educational value and performs similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

Citations (30)

View on Semantic Scholar

Summary

The paper introduces QuRating, a method that uses pairwise comparisons and Bradley-Terry ratings to select high-quality training data.
It demonstrates that models trained on curated data show lower perplexity and improved in-context learning performance.
The findings highlight the importance of balancing quality and diversity, with criterion-specific gains and promising curriculum learning applications.

QuRating: Selecting High-Quality Data for Training LLMs

In the current landscape of NLP, the importance of high-quality pre-training data for the creation of advanced LLMs (LMs) cannot be overstated. The paper "QuRating: Selecting High-Quality Data for Training LLMs" introduces an innovative approach to data selection, termed QuRating, which aims to capture abstract textual qualities that are intuitively perceived by humans. This method is particularly relevant given the growing size of training corpora and the need for more precise methods of data curation to optimize model performance.

Methodology

QuRating relies on several steps to assess and incorporate data quality:

Pairwise Comparative Judgments: The method begins by comparing pairs of texts based on certain quality criteria. These comparisons are performed by a state-of-the-art LLM, in this case, GPT-3.5-turbo, which gauges which text in a pair better exemplifies the specified quality.
Training the QuRater Model: The Bradley-Terry model is employed to convert these pairwise judgments into scalar ratings. The ratings are then learned by a dedicated QuRater model, fine-tuned on these pairwise comparisons.
Data Annotation: This QuRater model is used to annotate a large corpus, in this instance, a 260B-token subset of the SlimPajama dataset, with quality ratings across four specified criteria.
Data Selection and Training: Using these quality ratings, the authors sample 30B tokens and train 1.3B-parameter LMs. Various strategies for data selection, including top- $k$ selection and sampling based on quality scores with different temperatures, were explored.

The qualities assessed in this paper are:

Writing Style: Emphasis on polished and beautiful prose.
Facts {content} Trivia: Density of specific and obscure facts.
Educational Value: Presence of clear explanations and instructional content.
Required Expertise: Level of prior knowledge needed to comprehend the text.

Experimental Findings

The results from training models on subsets selected by QuRating reveal several insights:

Perplexity and In-Context Learning: The models trained using QuRating demonstrated lower perplexity and improved in-context learning (ICL) performance compared to those trained on uniformly sampled data, especially when certain quality criteria, like educational value and facts {content} trivia, were used.
Importance of Balance: Simple top- $k$ selection or selecting only the highest-rated documents was found to be less effective, highlighting the importance of balancing quality and diversity in the training data.
Criterion-specific Performance: Different criteria led to varying improvements; educational value showed the most consistent gains in ICL, while writing style, although resulting in the lowest perplexity, did not significantly improve ICL performance.
Curriculum Learning: Training models with a curriculum based on increasing required expertise throughout training cycles also showed promising improvements.

Analysis and Implications

In their analysis, the authors note that using pairwise comparisons instead of direct rating provided more stable and discriminative judgments from the LLM, reducing biases and inconsistencies. This model of comparative judgment is rooted in psychological research and aligns well with educational assessment methodologies that favor comparative over absolute scoring.

The extensive annotation and quality rating of a colossal 260B-token corpus allowed for an in-depth analysis of the text quality distribution across various domains. The paper found that while certain domains like Books scored higher in writing style, other domains like ArXiv scored higher in required expertise, reinforcing the notion of domain diversity in high-quality data.

Future Directions

The paper concludes with a discussion on the implications of QuRating and potential future directions. The results suggest that fine-grained quality assessment can significantly impact the efficiency and performance of LLM training, allowing for the creation of more capable models under resource constraints.

Potential future directions include:

Integration with Domain Optimization: Combining QuRating with methods that optimize the domain mixture of the training data could yield further improvements.
Mitigating Biases: Further research is needed to address and mitigate potential social, cultural, and linguistic biases inherent in data selection processes.
Exploration of Additional Criteria: Extending the set of qualitative criteria and refining the prompts for existing ones can help capture an even broader range of textual qualities.

Overall, this paper presents a systematic and empirically validated approach to enhancing LLM training through thoughtful data curation, underlining the fundamental role of data quality in NLP research and applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_awettig/status/1816381966951842089

https://twitter.com/_awettig/status/1758594889510539535

https://twitter.com/_awettig/status/1763028325805723829

https://twitter.com/_awettig/status/1814067200803619283