Emergent Mind

Abstract

The quality of training data impacts the performance of pre-trained LLMs (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

Overview

  • The paper introduces 'Skill-It,' a novel framework hypothesizing that language models learn in a hierarchal order mirroring human skill acquisition.

  • It proposes an operational definition of a language model 'skill' and utilizes a directed skills graph to outline training priorities.

  • An online data selection algorithm, Skill-It, dynamically optimizes skill learning, surpassing traditional methods in both pre-training and fine-tuning contexts.

  • The framework is validated by empirical evidence, showing marked improvements in LM capabilities when trained with the Skill-It approach on both synthetic and real datasets.

  • The application of Skill-It on the RedPajama dataset demonstrated higher performance and data efficiency, suggesting a potential new direction in LM training methodologies.

Understanding and Training Language Models with a Data-Driven Skills Framework

Abstract Overview

The paper presents a novel framework that hypothesizes language models (LMs) learn interdependent sets of skills from their training data in a certain order, much like human learning processes. This framework, coined as Skill-It, is grounded on the idea that some skills serve as prerequisites for others, thus affecting the efficiency of learning subsequent skills. Empirical evidence supports the existence of ordered skill sets, demonstrating that training LMs with consideration of skill order and prerequisite relationships is beneficial.

Establishing a Skills Framework

At the core of this research is the operational definition of an LM skill—the ability to execute behavior with associated data—alongside the concept of ordered skill sets. These sets encompass skills that are related through a directed skills graph. The graph serves as a blueprint, showing which skills should be prioritized in training regimes to reduce the amount of data needed for learning subsequent skills. Through both synthetic and real datasets, the study validates ordered skill sets where some skills are better learned when trained with their prerequisite skills.

Algorithm for Online Data Selection

The proposed framework guides the development of the Skill-It online data selection algorithm, which strategically samples from an LM's training data to optimize skill acquisition. Unlike traditional uniform sampling over data, Skill-Stratified Sampling employs the ordered skill set but lacks the dynamism to be efficient due to its static nature. Contrastingly, Skill-It adjusts the importance of skills as the LM training progresses, prioritizing unmastered or influential prerequisite skills through an online optimization problem. This agile approach helps in both continual pre-training and fine-tuning contexts, evidenced by results outperforming random sampling and curriculum learning baselines in synthetic LMs.

Evaluation and Application

The effectiveness of the Skill-It algorithm is notable when assessed against both synthetic and real datasets. It achieves significant improvements across various skills, with reported higher accuracy and lower loss in LMs compared to baselines. An application of the skills framework and Skill-It to the sizable RedPajama dataset yielded a remarkable efficiency, where a 3B-parameter LM trained with 1B tokens outperformed the baseline uniform sampling strategy utilizing 3B tokens.

Conclusion

The introduction of the Skill-It framework marks a considerable advance in the realm of LM training. This skills-based approach not only deepens the understanding of how LMs learn from data but also delivers a method for more effective and data-efficient training. Through the empirical demonstration of the methodology on diversified datasets, the researchers invite further exploration into the alignment of skills with data, potentially inaugurating a new phase of focused and efficient LM development.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.