Emergent Mind

LESS: Selecting Influential Data for Targeted Instruction Tuning

(2402.04333)
Published Feb 6, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Instruction tuning has unlocked powerful capabilities in LLMs, effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

Performance comparison of LESS algorithm with varying dimensions on three datasets, using 5% data for tuning.

Overview

  • LESS is a novel algorithm designed for targeted instruction tuning in LLMs, enabling the selection of highly relevant data from large datasets to enhance specific skills without overwhelming computational costs.

  • It introduces optimizer-aware influence estimation compatible with the Adam optimizer and employs LoRA and random projection techniques to improve efficiency and transferability across different model sizes.

  • Experiments indicate that training LLMs on a subset of data chosen by LESS can outperform training on full datasets, highlighting its potential for creating more focused and resource-efficient training protocols.

  • LESS offers a promising avenue for future research in real-time model adaptation, efficiency optimization, and reducing unintended biases, demonstrating the impact of precise data selection on the efficacy of LLMs.

LESS: An Efficient Algorithm for Targeted Instruction Tuning in LLMs

Introduction to LESS

LLMs have gained significant traction for their ability to serve as general-purpose chatbots, capable of generating human-like text based on provided instructions. However, for real-world applications that demand specialized capabilities, such as advanced reasoning, the challenge of sifting through extensive instruction tuning datasets to identify and utilize the most relevant data becomes apparent. This process, termed "targeted instruction tuning," is crucial for developing specific skills within LLMs without having to train on the entire dataset, which may contain irrelevant or even counterproductive information.

The proposed solution to this challenge is the algorithm LESS (Low-rank gradiEnt Similarity Search), which represents a novel method for selecting influential data from large instruction tuning datasets. LESS operates by effectively estimating data influences using optimizer-aware formulations and performing a low-rank gradient similarity search to pinpoint the examples most pertinent to enhancing the model's performance on a given task.

LESS: The Underlying Mechanism

Compatibility with Instruction Tuning

At its core, LESS modifies existing influence estimation methods to work efficiently with the Adam optimizer and manage variable-length instruction data. These adaptations are crucial given that LLMs often use Adam for fine-tuning due to its ability to handle sparse gradients and adjust learning rates automatically.

Efficiency Through LoRA and Random Projections

To address the computational and storage overhead associated with large model parameters, LESS employs LoRA (Low-Rank Adaptations) and random projection techniques to construct a gradient datastore. This datastore, consisting of low-dimensional gradient features, allows for efficient and effective dataset selection while being reusable for new target tasks, thus significantly reducing the computational cost.

Transferable Knowledge Across Models

A significant advantage of LESS is its ability to select data using gradients from smaller models to induce strong performance in larger models or even different model families. This transferability is crucial for practical applications where computational resources may be limited.

Interpretable Data Selection

LESS diverges from traditional methods that often rely on surface form cues for data selection. Instead, it focuses on identifying data that showcases similar reasoning and skill types required for the target task. This approach ensures that the selected data aligns more closely with the specific capabilities being targeted, rather than merely matching on language or topic.

Experimental Findings and Implications

The effectiveness of LESS is demonstrated through experiments on diverse downstream tasks, where training on only a 5% subset of data selected by LESS often outperforms training on the full dataset. This outcome underscores the potential for LESS to enable more focused and efficient training protocols, especially in scenarios where dataset size significantly outstrips the in-domain data necessary for specialized tasks.

Additionally, the ability of LESS to select transferable data across models introduces a promising avenue for reducing the computational costs associated with data selection and model training. Smaller models can be utilized to curate training datasets for larger, more complex models, facilitating a more resource-efficient workflow without compromising performance.

The Road Ahead

While LESS presents a significant advance in targeted instruction tuning for LLMs, several avenues remain open for further exploration. These include extending LESS for real-time model adaptation, optimizing the algorithm for even greater efficiency, and investigating its potential for reducing unintended model biases by selectively focusing on data that promotes fairness and inclusivity.

In summary, LESS stands as a testament to the potential of intelligent data selection in unlocking more specialized and efficient capabilities within the realm of LLMs, paving the way for their broader application across a myriad of tasks demanding high degrees of specificity and complexity.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.