MoDS: Model-oriented Data Selection for Instruction Tuning (2311.15653v1)

Published 27 Nov 2023 in cs.CL

Abstract: Instruction tuning has become the de facto method to equip LLMs with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction data for a given LLM is still an open problem. To address this problem, in this paper we present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity. First, our approach utilizes a quality evaluation model to filter out the high-quality subset from the original instruction dataset, and then designs an algorithm to further select from the high-quality subset a seed instruction dataset with good coverage. The seed dataset is applied to fine-tune the foundation LLM to obtain an initial instruction-following LLM. Finally, we develop a necessity evaluation model to find out the instruction data which are performed badly in the initial instruction-following LLM and consider them necessary instructions to further improve the LLMs. In this way, we can get a small high-quality, broad-coverage and high-necessity subset from the original instruction datasets. Experimental results show that, the model fine-tuned with 4,000 instruction pairs selected by our approach could perform better than the model fine-tuned with the full original dataset which includes 214k instruction data.

References (39)

Citations (54)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning with only 4,000 instruction pairs selected by MoDS outperforms models tuned on 214,000 pairs.
The paper introduces a quality evaluation model and a k-center greedy algorithm to ensure both high data quality and diverse instruction coverage.
The paper validates that targeted data selection significantly reduces computational costs while enhancing LLM instruction-following capabilities.

An Overview of "MoDS: Model-oriented Data Selection for Instruction Tuning"

The paper "MoDS: Model-oriented Data Selection for Instruction Tuning," authored by Qianlong Du, Chengqing Zong, and Jiajun Zhang, addresses the pivotal challenge of efficiently selecting high-quality instruction data for fine-tuning LLMs. This document presents the MoDS approach, which encodes a methodical strategy for enhancing LLM instruction-following capabilities by optimizing the quality, coverage, and necessity of input data. The research proposes a novel model-oriented data selection framework that leverages these three criteria to refine the subset of original datasets used in the instruction tuning process.

Instruction tuning is the prevailing method used to improve LLMs’ ability to execute user-given instructions accurately. Conventionally, this involves the fine-tuning of foundational LLMs with extensive datasets comprising numerous instruction-following pairs. However, recent findings suggest that employing a reduced set of high-quality instruction data may suffice for effective model tuning. Despite this advancement, the selection of appropriate instruction data tailored to specific LLMs remains an unresolved issue, which this paper aims to tackle.

Key Methodological Contributions

Quality Evaluation Model: The paper introduces a quality evaluation model to filter instruction data based on the perceived quality of both the instructional prompts and the expected outputs. This model helps in retaining only the high-quality subset of the original dataset.
K-Center Greedy Algorithm for Coverage: To ensure diversity and broad coverage, the MoDS approach implements a k-center greedy algorithm. This method selects a seed instruction dataset by maximizing the coverage of various instruction types.
Necessity Evaluation for Target LLMs: The necessity evaluation module identifies the instructional gaps within a given LLM by evaluating its performance on high-quality datasets. Instructions that lead to weaker LLM responses are earmarked for inclusion in an augmented dataset to address these weaknesses.
Experimental Validation: The paper validates the effectiveness of MoDS by demonstrating that an LLM fine-tuned with a dataset of only 4,000 instruction pairs selected through this approach performs better than one fine-tuned with an entire dataset of 214,000 instructions. This significant reduction in data while maintaining or even improving model performance constitutes a strong empirical result.

Experimental Setup and Results

The empirical evaluations employed in this paper utilize a series of both training and testing datasets. The paper compares the MoDS-tuned models with standard models trained on full datasets such as the Alpaca and a large Mixture Dataset. The winning scores are computed by assessing the models' instruction-following capabilities across diverse test sets, employing human-like judgment via comparator models like ChatGPT and GPT-4.

Implications and Future Directions

The implications of this research are twofold. Practically, the reduction in necessary data quantities for fine-tuning has significant computational and cost advantages, particularly given the scale of many modern LLMs. Theoretically, this paper supports the hypothesis that much of LLM instruction-following capability stems from the pre-training phase, with minimal additional data required primarily for enabling learned knowledge activation.

Looking ahead, future developments could explore the applicability of the MoDS framework to other domains of LLM application beyond generic instruction-following. Additionally, the exploration of further optimization and automation within the MoDS framework could yield even more efficient data selection processes. There is also intriguing potential in assessing how MoDS might adapt to or integrate emerging architectures and LLM paradigms.

In conclusion, the MoDS approach introduces a systematic and effective method for instruction data selection tailored to specific LLM capabilities, offering a significant contribution to the field of AI and machine learning by optimizing the balance between data quantity and quality needed for sophisticated language understanding.

PDF Markdown

Related Papers

GitHub

GitHub - CASIA-LM/MoDS (142 stars)