Automated Data Curation for Robust Language Model Fine-Tuning (2403.12776v1)
Abstract: LLMs have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm. We introduce an automated data curation pipeline CLEAR (Confidence-based LLM Evaluation And Rectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or corrects it. Automatically identifying which data to filter or correct is done via LLM-derived confidence estimates, to ensure only confident modifications to the dataset. Unlike existing data curation techniques, CLEAR is a comprehensive framework that can improve a dataset (and trained model outputs) without additional fine-tuning computations. We don't assume access to a stronger LLM than the model being fine-tuned (e.g.\ relying on GPT-4 when fine-tuning GPT-3.5), to see whether CLEAR can meaningfully improve the capabilities of any LLM. Experiments reveal that CLEAR consistently improves the performance of fine-tuned models across many datasets and models (like GPT-3.5 and Llama2).
Summary
- The paper introduces CLEAR, a pipeline that leverages LLM-derived confidence scores to filter and correct noisy instruction tuning data.
- CLEAR uses a two-stage process—Auto-Filter and Auto-Correct—with BSDetector to reliably evaluate and improve response quality.
- Empirical results on datasets like SQUAD-N, Emails-N, and DROP-N show that CLEAR significantly boosts fine-tuning performance by iteratively refining training data.
The paper "Automated Data Curation for Robust LLM Fine-Tuning" (2403.12776) introduces CLEAR (Confidence-based LLM Evaluation And Rectification), an automated pipeline designed to improve the quality of instruction tuning datasets for LLMs. The core idea is a data-centric approach: instead of solely focusing on refining the fine-tuning algorithm, CLEAR systematically improves the dataset used for training. This is particularly relevant in real-world scenarios where instruction tuning data is often noisy, containing inaccurate responses, poor formatting, or irrelevant examples, which can significantly degrade the performance of fine-tuned models.
CLEAR operates in two main stages: Auto-Filter and Auto-Correct. Both stages rely on confidence estimates derived from LLMs to make informed decisions about data quality. The key is to perform these modifications conservatively, ensuring that only confidently low-quality data is removed or confidently better alternatives are used for correction.
The CLEAR Pipeline
The pipeline begins with an original instruction tuning dataset consisting of (prompt, target response) pairs (xi,yi).
- Auto-Filter: The first step is to identify and remove low-quality data confidently. This is done before the main fine-tuning process.
- Auto-Correct: After an initial fine-tuning phase (preferably on the Auto-Filtered data), the resulting model is used to generate candidate responses for some or all prompts. These candidates are then evaluated against the original target responses, and confidently better candidates replace the original targets in the dataset.
- Iterative Improvement: The fine-tuned LLM can be retrained on the Auto-Corrected dataset. This process of fine-tuning and data correction can potentially be iterated to further refine the dataset and model.
This process is illustrated in Figure 1 of the paper, showing the flow from original data through filtering and correction steps, leading to an improved dataset for fine-tuning.
Confidence-Based Evaluation
A critical component of CLEAR is the method for estimating the quality of responses or comparing two responses. The paper highlights that directly prompting an LLM to score response quality (e.g., on a 1-5 scale, as shown in Table 5) can be unreliable. Instead, CLEAR leverages BSDetector (2308.16175), a technique that provides confidence estimates (between 0 and 1) about an LLM's output quality or preference decisions.
BSDetector works by considering two factors:
- Observed Consistency: The LLM generates multiple candidate responses for the same prompt (e.g., via temperature sampling). Confidence is higher if the target response is semantically similar to these diverse generations.
- Self-Reflection Certainty: The LLM is also prompted to directly evaluate the target response and report its confidence.
These factors are combined to produce a single confidence score. This approach is model-agnostic, working with any LLM (including black-box APIs like GPT-3.5/4), and doesn't require access to model parameters or specific training. The paper's experiments (Figure 2, Table 3) show that this confidence-based approach is more effective at identifying low-quality data than direct LLM scoring.
Implementing Auto-Filter
The Auto-Filter stage aims to create a cleaner subset of the original dataset for initial fine-tuning.
Implementation Steps:
- Confidence Estimation: For every pair (xi,yi) in the original dataset, use the base pre-trained LLM and the BSDetector method to compute a confidence score ci that yi is a high-quality response for xi.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
from bsdector import BSDetector # Assuming a library implementation base_LLM = load_base_LLM() # Load your base LLM or configure API access bsdetector = BSDetector(base_LLM) dataset = load_instruction_tuning_data() # List of (prompt, response) tuples confidence_scores = [] for prompt, response in dataset: confidence = bsdetector.estimate_quality_confidence(prompt, response) confidence_scores.append(confidence) # Store scores, perhaps alongside data: [(prompt_i, response_i, confidence_i), ...] annotated_dataset = [(dataset[i][0], dataset[i][1], confidence_scores[i]) for i in range(len(dataset))]
- Set Threshold: Determine a confidence threshold γ. The paper uses the median confidence score of the dataset as a simple heuristic. Alternatively, γ could be tuned on a small validation set or set based on manual inspection of examples around different confidence levels.
1 2 3 4
import numpy as np # Example: Using the median confidence as threshold gamma = np.median(confidence_scores)
- Filter Data: Create the filtered dataset F by keeping only the pairs where ci>γ.
1
filtered_dataset = [(p, r) for p, r, c in annotated_dataset if c > gamma]
- Fine-tune: Fine-tune the LLM on the
filtered_dataset
. This is the first fine-tuning pass.1
finetune_LLM(base_LLM, filtered_dataset) # Use your fine-tuning script/API
Practical Considerations for Auto-Filter:
- Computational Cost: Running BSDetector involves multiple LLM calls per data point, which can be expensive, especially for large datasets and expensive models (like GPT-4). Optimizing BSDetector calls or using a cheaper base model for this stage might be necessary.
- Threshold γ: Setting γ is a trade-off. A high γ removes more potentially noisy data but also reduces the total training set size. A low γ retains more data but includes more noise. The median heuristic is simple but may not be optimal for all datasets.
- Base LLM Choice: Using the same base LLM for BSDetector as is being fine-tuned ensures the confidence estimates are relevant to the model's capabilities, which is a key aspect of the paper's methodology.
Implementing Auto-Correct
The Auto-Correct stage aims to improve low-quality examples that were either filtered or retained but flagged as potentially problematic.
Implementation Steps:
- Generate Candidate Responses: Use the LLM fine-tuned on the Auto-Filtered data (or the original data if skipping Auto-Filter) to generate a candidate response yi′ for prompts xi. This is done for examples where the original response yi was flagged as low confidence (e.g., ci≤γ). The paper shows benefits in using the fine-tuned model for this step compared to the base model (Table 4).
1 2 3 4 5 6 7 8 9 10 11 12
finetuned_LLM = load_finetuned_LLM() # Load the model trained on filtered_dataset corrected_dataset = [] # Iterate through original dataset or the filtered-out portion for prompt, original_response, confidence in annotated_dataset: if confidence <= gamma: # Or some other criteria for potential correction candidate_response = finetuned_LLM.generate(prompt) # Keep track of original and candidate for comparison corrected_dataset.append((prompt, original_response, candidate_response)) else: # Keep high-confidence examples as they are corrected_dataset.append((prompt, original_response, original_response)) # No correction needed
- Evaluate Candidate vs. Original: For examples where a candidate response yi′ was generated, use an LLM-as-judge approach to determine if yi′ is better than yi. The paper uses the base LLM and the prompt from Table 1 for this evaluation. BSDetector is then used to estimate the confidence c^i in this judgment (i.e., confidence that the judge's verdict is correct, specifically confidence that yi′ is indeed better than yi).
(Note: The exact BSDetector function1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
# Assuming annotated_dataset now has (prompt, original_response, candidate_response) tuples for examples potentially corrected final_dataset_for_finetuning = [] for prompt, original_response, candidate_response in corrected_dataset: if original_response == candidate_response: # No correction attempted or needed final_dataset_for_finetuning.append((prompt, original_response)) else: # Use base LLM as judge with the prompt from Table 1 # And use BSDetector to get confidence in the judge's preference judge_output = base_LLM.judge(prompt, original_response, candidate_response) # Pseudo-code for invoking judge # Use BSDetector to get confidence in the judge's verdict (e.g., [[B]] meaning candidate is better) # This step might require adapting BSDetector to evaluate confidence in preference judgments, # which is mentioned in Chen and Mueller (2023) [2308.16175]. preference_confidence = bsdetector.estimate_preference_confidence( prompt, original_response, candidate_response, judge_output ) # Set threshold eta (paper uses eta=0.8) eta = 0.8 if judge_output == "[[B]]" and preference_confidence > eta: # [[B]] means candidate response is better final_dataset_for_finetuning.append((prompt, candidate_response)) # Use corrected response else: # If candidate is not confidently better, either keep original or filter entirely # Paper's Auto-Correct section implies replacing only if confidently better. # If not confidently better, the original example might be filtered entirely # or kept with its original response, depending on the pipeline variant. # Based on Figure 3, if not confidently better, the example is filtered. # This means the Auto-Corrected dataset might be a subset of the original. # Let's assume filtering if not confidently better for simplicity based on Figure 3 logic. # A more complex implementation might track original confidence and decide whether to filter or keep original. print(f"Example filtered out: Prompt='{prompt[:50]}...', Original Conf: {confidence}, Preference Conf: {preference_confidence}") pass # This example is filtered out
estimate_preference_confidence
is conceptual here based on the paper's description of BSDetector estimating confidence for preference predictions) - Fine-tune (Again): Fine-tune the LLM on the resulting
final_dataset_for_finetuning
. This dataset contains high-confidence original examples and examples where the original response was replaced by a confidently better LLM-generated candidate.1
finetune_LLM(finetuned_LLM, final_dataset_for_finetuning) # Retrain on the refined dataset
Practical Considerations for Auto-Correct:
- Computational Cost: Generating candidate responses requires LLM calls. The LLM-as-judge step and its BSDetector confidence estimation also add computational overhead.
- Threshold η: The threshold η controls how aggressively corrections are applied. A higher η means fewer corrections but higher confidence in the changes.
- Choice of LLM for Correction: Using the fine-tuned LLM to generate candidates is beneficial because it is specialized to the domain. Using the base LLM as a judge provides a more objective assessment, less influenced by the fine-tuned model's potential biases or errors.
- Iterative Refinement: The paper suggests the process can be iterated. Each iteration might use the newly fine-tuned model to generate candidates for the next round of correction. The number of iterations is a practical hyperparameter.
Real-World Applications and Implications
CLEAR's practical value lies in its ability to systematically improve data quality for instruction tuning without requiring manual data annotation or relying on stronger, potentially unavailable or expensive, teacher models (like GPT-4 for fine-tuning Llama2). This makes it applicable in scenarios where:
- Domain-Specific Fine-tuning: Datasets are collected for niche domains where generic powerful models might not perform well, and creating high-quality data manually is expensive.
- Noisy Public Datasets: Fine-tuning on publicly available datasets that are known to contain errors or inconsistencies (like datasets scraped from the web or user interactions).
- Improving Existing Models: Enhancing the performance of an already fine-tuned model by curating a better dataset for subsequent training rounds.
- Data Scarcity (relative): While filtering removes data, the Auto-Correct stage attempts to salvage potentially useful prompts by fixing responses, mitigating the impact of simple filtering alone.
The paper's results across SQUAD-N, Emails-N, and DROP-N datasets (Tables 2, 3, 4) demonstrate consistent improvements in both response accuracy and format adherence (Valid JSON %). This highlights that investing in data curation via methods like CLEAR can be more impactful than solely focusing on model or algorithm changes, aligning with the principles of data-centric AI.
Limitations
A key limitation mentioned in the paper is that CLEAR does not explicitly account for or mitigate biases present in the original dataset. If the training data contains harmful biases, the fine-tuned model might perpetuate or even amplify them, even with corrected responses, as the underlying patterns of bias might remain in the data distribution or be introduced by the LLM judge or generator. This is a crucial consideration for deploying models trained with CLEAR in sensitive applications.
In summary, the CLEAR pipeline offers a practical, automated framework for enhancing the quality of instruction tuning data for LLMs by leveraging LLM-derived confidence scores to filter and correct data. Its model-agnostic nature and ability to improve models without relying on stronger teacher models make it a valuable tool for practitioners dealing with real-world, noisy datasets for specialized LLM tasks. Implementing CLEAR involves integrating confidence estimation (like BSDetector) and LLM-as-judge components into a standard fine-tuning workflow, considering the associated computational costs and the tuning of confidence thresholds.
Related Papers
- Small Language Models Improve Giants by Rewriting Their Outputs (2023)
- Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately (2024)
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning (2024)
- LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement (2024)
- Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models (2024)