Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Published 12 Jul 2023 in cs.CL, cs.AI, and cs.LG | (2307.06290v3)

Abstract: LLMs are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized guideline for selecting high-quality datasets to optimize this process remains elusive. In this paper, we first propose InstructMining, an innovative method designed for automatically selecting premium instruction-following data for finetuning LLMs. Specifically, InstructMining utilizes natural language indicators as a measure of data quality, applying them to evaluate unseen datasets. During experimentation, we discover that double descent phenomenon exists in LLM finetuning. Based on this observation, we further leverage BlendSearch to help find the best subset among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show that InstructMining-7B achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.

Abstract PDF Upgrade to Chat

Citations (20)

View on Semantic Scholar

Summary

The paper introduces InstructMining, a framework that uses natural language indicators to automatically select premium instruction data for LLM fine-tuning.
It reveals a double descent phenomenon where model performance peaks with optimal data quality rather than increasing dataset size.
Using BlendSearch, the method reduces training data to 2.5% of typical volumes while achieving state-of-the-art benchmark results.

The paper "Instruction Mining: When Data Mining Meets LLM Finetuning" addresses the challenge of efficiently selecting high-quality instruction-following datasets for fine-tuning LLMs. The main contribution is the development of a method termed InstructMining, designed to automatically evaluate and select premium data subsets for this purpose.

Key highlights from the paper include:

InstructMining Framework: Utilizes natural language indicators (e.g., reward model scores) to estimate the quality of instruction data, facilitating the selection of high-quality subsets without requiring traditional, labor-intensive selection processes involving human oversight.
Double Descent Phenomenon: Observations reveal a double descent in the LLM finetuning process, where model performance does not monotonically improve with increased dataset size. This phenomenon indicates that after a certain data threshold, the quality's contribution diminishes relative to dataset quantity.
BlendSearch Application: To optimize the subset selection process, the paper employs BlendSearch, an efficient hyperparameter search strategy, to discover the optimal subset size for fine-tuning through automated balancing between data quality and quantity.
Empirical Results:
- InstructMining-7B Performance: Achieved state-of-the-art results on popular benchmarks like LLM-as-a-judge and Huggingface OpenLLM benchmarks, thus validating its effectiveness.
- Efficiency: Demonstrated significant efficiency by reducing training data to 2.5% of a typical dataset size (i.e., selecting 2,532 samples out of 100,000) while maintaining performance metrics competitive to those obtained with a larger dataset.
Statistical Parameterization: The study leverages a multivariate regression framework using selected natural language indicators to predict data quality, supporting automatic evaluation without computationally expensive full model finetunings.
Robustness and Application: The framework is tested across various settings, indicating its applicability to different base models, model sizes, and parameter-efficient training methods like LoRA.
Significance of Indicators: In the quality evaluation space, indicators like reward scores, understandability, and naturalness are highlighted as pivotal, with specific indicators (e.g., reward score) showing robust significance in the regression models.

The paper positions InstructMining as an impactful framework that integrates classical data mining approaches into LLM finetuning, emphasizing effectively aligning large models with instruction-following capabilities efficiently and cost-effectively.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections