Emergent Mind

QuRating: Selecting High-Quality Data for Training Language Models

(2402.09739)
Published Feb 15, 2024 in cs.CL and cs.LG

Abstract

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We find that LLMs are able to discern these qualities and observe that they are better at making pairwise judgments of texts than at rating the quality of a text directly. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity, as selecting only the highest-rated documents leads to poor results. When we sample using quality ratings as logits over documents, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

LLM-comparative judgements train QuRater to assign quality ratings to language model training documents.

Overview

  • QuRating introduces an innovative method to select high-quality data for training language models, which involves pairwise comparative judgments performed by GPT-3.5-turbo, converted into scalar ratings using the Bradley-Terry model.

  • Experimental findings show that models trained with QuRating-selected data demonstrate lower perplexity and better in-context learning compared to uniformly sampled data, with the importance of balancing quality and diversity in the training data being emphasized.

  • The study's analysis suggests that pairwise comparisons provide more stable and discriminative judgments, with future directions including integration with domain optimization, mitigating biases, and exploration of additional criteria for data quality assessment.

QuRating: Selecting High-Quality Data for Training Language Models

In the current landscape of NLP, the importance of high-quality pre-training data for the creation of advanced Language Models (LMs) cannot be overstated. The paper "QuRating: Selecting High-Quality Data for Training Language Models" introduces an innovative approach to data selection, termed QuRating, which aims to capture abstract textual qualities that are intuitively perceived by humans. This method is particularly relevant given the growing size of training corpora and the need for more precise methods of data curation to optimize model performance.

Methodology

QuRating relies on several steps to assess and incorporate data quality:

  1. Pairwise Comparative Judgments: The method begins by comparing pairs of texts based on certain quality criteria. These comparisons are performed by a state-of-the-art LLM, in this case, GPT-3.5-turbo, which gauges which text in a pair better exemplifies the specified quality.
  2. Training the QuRater Model: The Bradley-Terry model is employed to convert these pairwise judgments into scalar ratings. The ratings are then learned by a dedicated QuRater model, fine-tuned on these pairwise comparisons.
  3. Data Annotation: This QuRater model is used to annotate a large corpus, in this instance, a 260B-token subset of the SlimPajama dataset, with quality ratings across four specified criteria.
  4. Data Selection and Training: Using these quality ratings, the authors sample 30B tokens and train 1.3B-parameter LMs. Various strategies for data selection, including top-$k$ selection and sampling based on quality scores with different temperatures, were explored.

The qualities assessed in this paper are:

  • Writing Style: Emphasis on polished and beautiful prose.
  • Facts {content} Trivia: Density of specific and obscure facts.
  • Educational Value: Presence of clear explanations and instructional content.
  • Required Expertise: Level of prior knowledge needed to comprehend the text.

Experimental Findings

The results from training models on subsets selected by QuRating reveal several insights:

  1. Perplexity and In-Context Learning: The models trained using QuRating demonstrated lower perplexity and improved in-context learning (ICL) performance compared to those trained on uniformly sampled data, especially when certain quality criteria, like educational value and facts {content} trivia, were used.
  2. Importance of Balance: Simple top-$k$ selection or selecting only the highest-rated documents was found to be less effective, highlighting the importance of balancing quality and diversity in the training data.
  3. Criterion-specific Performance: Different criteria led to varying improvements; educational value showed the most consistent gains in ICL, while writing style, although resulting in the lowest perplexity, did not significantly improve ICL performance.
  4. Curriculum Learning: Training models with a curriculum based on increasing required expertise throughout training cycles also showed promising improvements.

Analysis and Implications

In their analysis, the authors note that using pairwise comparisons instead of direct rating provided more stable and discriminative judgments from the LLM, reducing biases and inconsistencies. This model of comparative judgment is rooted in psychological research and aligns well with educational assessment methodologies that favor comparative over absolute scoring.

The extensive annotation and quality rating of a colossal 260B-token corpus allowed for an in-depth analysis of the text quality distribution across various domains. The study found that while certain domains like Books scored higher in writing style, other domains like ArXiv scored higher in required expertise, reinforcing the notion of domain diversity in high-quality data.

Future Directions

The paper concludes with a discussion on the implications of QuRating and potential future directions. The results suggest that fine-grained quality assessment can significantly impact the efficiency and performance of LLM training, allowing for the creation of more capable models under resource constraints.

Potential future directions include:

  • Integration with Domain Optimization: Combining QuRating with methods that optimize the domain mixture of the training data could yield further improvements.
  • Mitigating Biases: Further research is needed to address and mitigate potential social, cultural, and linguistic biases inherent in data selection processes.
  • Exploration of Additional Criteria: Extending the set of qualitative criteria and refining the prompts for existing ones can help capture an even broader range of textual qualities.

Overall, this paper presents a systematic and empirically validated approach to enhancing language model training through thoughtful data curation, underlining the fundamental role of data quality in NLP research and applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.