Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models (2307.14430v1)

Published 26 Jul 2023 in cs.CL and cs.LG

Abstract: The quality of training data impacts the performance of pre-trained large LMs. Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, LLMs also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

Citations (34)

View on Semantic Scholar

Summary

The paper presents a framework that formalizes and orders skills via a directed graph to improve language model training efficiency.
It introduces the Skill-it algorithm, an online data selection method that adjusts sampling based on skill performance for various training settings.
Experimental results on synthetic and real datasets show that leveraging skill dependencies significantly boosts training efficiency and transfers to larger models.

Skill-it! A Data-Driven Skills Framework for Understanding and Training LLMs

The paper "Skill-it! A Data-Driven Skills Framework for Understanding and Training LLMs" presents an innovative approach to optimize the training of LMs by leveraging the concept of ordered skills. It proposes a framework that formalizes skills and their ordering in datasets to enable more efficient learning using a novel algorithm called Skill-it. Below is a breakdown of the methodology, results, and practical implications.

Skills Framework and Definitions

The paper defines a skill as a model behavior unit identifiable via associated data, which, when trained upon, enhances the model's performance in validation on the skill's data subset. An ordered skill set is constructed through a directed graph where skills, represented as nodes, have directed edges if training on one skill reduces the data required to learn another. This construction aims to capture inherent training data synergies and facilitate strategic data sampling.

Figure 1: Inspired by how humans acquire knowledge, we hypothesize that LMs best learn skills in a particular order, which can help improve our understanding and training of LMs. Ordered skill sets exist in real data, enabling skills to be learned with less data given training on their prerequisites.

The paper further formalizes the skills graph, detailing methods for its construction and validation using both synthetic data (e.g., LEGO and addition datasets) and real-world tasks (e.g., Natural Instructions). The existence of ordered skill sets was demonstrated, with training efficiency improvements quantified when training incorporates prerequisite skills.

Skill-it Algorithm for Data Selection

Skill-it is introduced as an online data selection algorithm that samples data based on skills' learned ordering, ensuring efficient LM training under various settings: continual pre-training, fine-tuning, and out-of-domain. Each setting aims to leverage the learned skills graph to navigate datasets dynamically, adjusting sample weights towards still-underperforming skills or influential prerequisites.

Figure 2: On synthetic datasets and Natural Instructions, ordered skill sets were identified where training on a mixture of skills leads to more efficient learning of an individual skill than training only on that skill.

Skill-it operates over multiple rounds, adjusting sampling strategies in response to evaluation skill losses, and shows superiority over static approaches and alternative baselines, such as random or stratified sampling.

Experimental Results and Observations

The empirical evaluation on various datasets and task settings demonstrates Skill-it's efficacy. For instance, in the LEGO synthetic dataset, Skill-it achieved a 35.8-point accuracy improvement over standard methods. In Natural Instructions, it delivered a 13.6% validation loss reduction during fine-tuning.

Figure 3: Performance of Skill-it in the continual pre-training setting shows significant efficiency over standard sampling in learning all skills collectively.

These findings suggest that modeling and utilizing skill dependencies are crucial for improving training efficiency. The transferability of learned skill graphs is noted, as graphs trained on smaller models can inform larger model training, optimizing computational resources.

Implications and Future Work

The developments presented in this work stress the importance of understanding training data structuring within the scope of LLM development. The usage of skill ordering enables the conservation of training resources and can direct research towards further exploration in skill discovery and exploitation of cross-task synergies.

Future research could involve exploring methods for deriving skill orders from large-scale datasets and investigating fine-grained adjustments to edge weights in skills graphs over varying model architectures.

Conclusion

The paper successfully integrates pedagogy-inspired concepts into machine learning, offering concrete techniques to enhance LLM training through informed data selection strategies. The Skill-it framework provides a promising direction for achieving accelerated and efficient LLM training while underlining the broader significance of skill-driven data structuration in AI research.