A Survey on Data Selection for Language Models

Published Feb 26, 2024 in cs.CL and cs.LG


A major factor in the recent success of LLMs is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

Overview of a data pipeline detailing steps from raw data to training language models.


  • Data selection is critical in training language models, especially LLMs, necessitating strategies for managing vast, diverse datasets to enhance model accuracy, efficiency, and fairness.

  • The paper introduces a taxonomy of data selection methods focused on distribution matching for domain-specific precision and distribution diversification for general applicability and robustness.

  • Pretraining LLMs involves filtering extensive datasets (like Common Crawl) to eliminate low-quality content, using heuristic and sophisticated model-based methods to preserve high-quality data.

  • Future advancements in data selection are tied to developing direct data evaluation metrics, comprehensive benchmarks, and strategies for balancing memorization and generalization.

Comprehensive Review on Data Selection Methods for Language Models

Introduction to Data Selection in Machine Learning

Data selection is a pivotal aspect of the machine learning pipeline, particularly relevant in the age of LLMs which are trained on massive, heterogeneous corpora. Selecting the right data for training these models is not a straightforward task—it involves identifying which subsets of data will lead to the best model performance in terms of accuracy, efficiency, and fairness. The challenge lies not only in handling the sheer volume of available data but also in mitigating the variance in its quality.

Taxonomy of Data Selection Methods

A broad classification of data selection practices can be encapsulated into two primary goals: matching the distribution of the training data to the target task (distribution matching) and enhancing the coverage and diversity of the dataset (distribution diversification). Both approaches have their applications, with the former being crucial for domain-specific tasks requiring high precision, and the latter for general-purpose models necessitating robustness and broad applicability.

The process of data selection comprises several strategic components, notably:

  • Utility Function Definition: This involves mapping data points to a numeric value representing their utility, which is crucial for filtering and prioritizing data.
  • Selection Mechanism: Utilized to decide which data points are included in the training set based on their assigned utility values.
  • Dataset Characteristics Adjustment: Methods under this category operate on altering the dataset's distribution to favor certain characteristics deemed desirable for the training objectives.

Pretraining Data Selection

For pretraining LLMs, the goal is often to filter and curate data from extensive datasets like the Common Crawl corpus, ensuring the removal of low-quality or irrelevant information while retaining high-quality content. Various heuristic approaches are employed for this purpose, alongside more sophisticated model-based and perplexity-based quality filtering. The challenge is to achieve a balance that favors data efficiency and model performance without introducing significant biases.

Enhancing Language Model Performance through Specific Data Selection Techniques

  • Fine-tuning and Multitask Learning: These methods leverage auxiliary datasets or diverse tasks to improve model performance on specific targets or across a multitude of tasks. The emphasis here is on domain-specific selection, where additional data is judiciously chosen to closely mirror the task at hand.
  • In-Context Learning: Techniques focusing on selecting or generating potent demonstrations within prompts to guide the model more effectively, demonstrating how precision in data selection can significantly influence model behavior even without direct training on that data.
  • Task-specific Fine-tuning: Task-specific settings call for strategies that either increase the training data’s alignment with the target task or optimize data efficiency and robustness by carefully curating and diversifying the training samples.

Future Directions and Challenges

The review underlines the nuanced trade-offs between memorization and generalization inherent in data selection decisions. Innovations in direct data evaluation metrics, development of comprehensive benchmarks, and the shift towards more comprehensive data processing strategies are highlighted as key future directions.


This survey aims to provide a structured understanding of the landscape of data selection methods in machine learning, with a focus on LLMs. It emphasizes the intricate balance required in selecting data that both aligns with target tasks and ensures models are robust, fair, and efficient. As the field evolves, so too will the strategies for selecting the optimal datasets, underscoring the importance of continued research and innovation in this space.

