Better Synthetic Data by Retrieving and Transforming Existing Datasets (2404.14361v3)

Published 22 Apr 2024 in cs.CL

Abstract: Despite recent advances in LLMs, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using LLMs, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning LLMs via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

Citations (13)

View on Semantic Scholar

Summary

The paper demonstrates that DataTune significantly improves synthetic data quality by retrieving and transforming datasets, achieving an average 11-point performance boost.
It outlines a systematic multi-step process including dataset retrieval, schema-based transformation, and fine-tuning, enhancing data diversity and complexity.
The method surpasses traditional few-shot prompting and synthetic generation approaches, offering actionable insights for refining low-resource machine learning models.

Enhancing LLMs with DataTune for Dataset Transformation and Repurposing

Introduction to DataTune

DataTune represents a novel approach to improving dataset generation for machine learning models, especially for tasks where task-specific data is scarce or inaccessible. By leveraging existing, publicly available datasets and transforming them into formats aligned with specific model requirements, DataTune enhances both the diversity and difficulty of training datasets. This approach has been demonstrated to significantly outperform traditional few-shot prompting and existing dataset generation methods, providing a substantive improvement in supervised fine-tuning for models with fewer than 3 billion parameters.

Problem Settings and Methodology

DataTune operates under the challenge of repurposing existing datasets to create fine-tuning data aligned with new task specifications. This involves a multi-step process:

Dataset Retrieval: Leveraging a dual-stage approach including DataFinder and advanced LLM reranking to identify the most relevant datasets from extensive repositories like the HuggingFace Hub.
Dataset Transformation: The core of DataTune, this step involves planning and executing modifications on selected datasets to align them with the target task. This includes schema selection to filter irrelevant data, planning to detail the sequence of transformations, and finally executing these transformations on the dataset.

Experimental Setup

In rigorous evaluations using the BIG-Bench benchmark across six diverse language-based tasks, DataTune has shown promising improvements. The method uses retrieval and transformation processes to generate datasets which then undergo fine-tuning on an LLM. The performance of the DataTune-enhanced models surpasses that of models trained on both purely synthetic data and those enhanced by existing methods like Prompt2Model.

Fine-tuning is performed on a baseline model (Mistral-7B) using datasets generated through various methods, including DataTune and synthetic generation.
Comparative analysis includes both individual method assessments and their combinations, providing insights into the additive benefits of integrating DataTune with synthetic data generation.

Results and Analytical Insights

Performance Enhancement: Models fine-tuned on DataTune-generated data not only improve over baseline few-shot performances but also show superior results compared to using existing or purely synthetic datasets. For instance, DataTune improves performance by an average of 11 points over datasets retrieved without transformation and by 2.9 points over synthetic datasets.
Dataset Quality: DataTune successfully increases the diversity and complexity of the datasets. It decreases the duplication rate in generated datasets and presents more lexically diverse training examples compared to traditional synthetic generation methods.
Task Complexity: The transformed datasets tend to include more challenging examples, fostering models that potentially generalize better across more complex real-world applications.

Limitations and Future Directions

Several limitations currently bound the applicability and efficiency of DataTune:

LLM Dependency: High dependency on multiple LLM queries for data transformation which might prove costly.
Non-English Data Handling: Inefficacies in managing tasks involving non-English datasets, often leading to improper data processing.
Model Dependence: The system depends heavily on instruction-following capabilities of LLMs, limiting the choice of usable models.

Future enhancements could explore reducing reliance on costly LLM operations, broadening language support, and streamlining transformation processes. Further research might also delve into the effectiveness of open-web data retrieving and adapting for fine-tuning datasets.

Conclusion

DataTune establishes a robust framework for enhancing dataset generation through the innovative transformation of existing data resources. It offers significant improvements over existing methods and sets a promising direction for future research in dataset creation and model fine-tuning strategies. As LLMs continue to evolve, methods like DataTune will be crucial in maximizing their potential across a broader spectrum of tasks and languages.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1782600350282715532

https://twitter.com/aomaru_21490/status/1810553283335143711

https://twitter.com/IntuitMachine/status/1783075734937911608

https://twitter.com/fly51fly/status/1782882246028124319

https://twitter.com/saumyagandhi007/status/1783347436062167270

https://twitter.com/_reachsumit/status/1782595235585155518