On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey (2406.15126v1)

Published 14 Jun 2024 in cs.CL

Abstract: Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of LLMs offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.

Citations (34)

View on Semantic Scholar

Summary

The paper presents a comprehensive workflow for synthetic data generation using LLMs, emphasizing prompt engineering and multi-step generation techniques.
The paper details robust data curation and evaluation strategies, including heuristic filtering and label enhancement, to ensure the high quality of synthetic datasets.
The paper outlines future directions by proposing advanced task decomposition, external knowledge incorporation, and human-model collaboration to refine synthetic data workflows.

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

The paper "On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey," authored by Lin Long et al., offers a comprehensive overview of the roles and potential of LLMs in synthetic data generation. The paper delineates the current landscape, underpinned by a well-organized workflow encompassing data generation, curation, and evaluation. This paper aims to address existing gaps, provide a framework for future studies, and assist both academic and industrial entities in more effective utilization of LLMs for synthetic data-driven tasks.

Introduction

The critical issue of data quantity and quality in deep learning (DL) has persisted despite numerous advancements. LLMs present a promising solution by generating synthetic data to mitigate the limitations tied to real-world data scarcity, high acquisition costs, and privacy issues. However, the current research landscape on this topic lacks a unified framework. This paper consolidates various studies into a comprehensive workflow, highlighting existing gaps and proposing directions for future research.

Preliminaries

The primary challenge addressed is generating high-quality synthetic data using pre-trained LLMs. This typically involves data augmentation based on a small set of seed samples, formulated in the paper's Equation 1. The requisites for generated data, faithfulness and diversity, are the focal points of the research. Faithfulness refers to logical and grammatical coherence, while diversity ensures variability in text attributes such as length and style.

Generic Workflow

Data Generation

The paper underscores prompt engineering and multi-step generation as pivotal for effective synthetic data generation. Prompt engineering involves the design of sophisticated prompts encompassing task specification, conditional prompting, and in-context learning (ICL). An effective prompt enhances the faithfulness and diversity of the synthetic data by guiding the LLMs with explicit instructions and integrating heuristic conditions and exemplars.

Multi-step generation, on the other hand, entails decomposing complex generation tasks into simpler sub-tasks, which improves the overall quality of the generated samples by allowing the LLMs to focus on smaller, more manageable components.

Data Curation

Post-generation, a significant portion of the synthetic data is likely to be noisy or irrelevant. The paper details two main approaches for data curation: high-quality sample filtering and label enhancement. Filtering involves heuristic metrics and sample re-weighting to identify and select the most valuable data. Label enhancement leverages human intervention or auxiliary models for relabeling to rectify erroneous annotations.

Data Evaluation

The quality of the generated data must be evaluated directly and indirectly. Direct evaluation assesses data faithfulness and diversity using metrics such as cosine similarity and vocabulary statistics, while indirect evaluation measures the impact of synthetic data on the performance of downstream models.

Future Directions

The paper outlines several future research areas:

Complex Task Decomposition: Enhancing the reasoning and planning capabilities of LLMs for complex data generation tasks.
Knowledge Enhancement: Leveraging external knowledge bases to improve the accuracy and diversity of the generated data.
Synergy between Large and Small LMs: Exploring collaborative strategies between LLMs and smaller, domain-specific models for more efficient data generation.
Human-Model Collaboration: Designing interactive systems to incorporate essential human expertise for data annotation and verification, thereby ensuring the generated data's reliability.

Conclusion

This survey provides a detailed exploration of LLMs-driven synthetic data generation, curation, and evaluation. It aims to facilitate the development of domain-specific datasets and highlight the challenges and opportunities within this field. The insights provided are valuable for advancing data-centric AI and promoting the efficient production of high-quality data in various domains.

Implications

The implications of this research are far-reaching. Practically, it could enable the scalable generation of diverse, high-quality datasets, substantially reducing the dependence on human-labeled data. Theoretically, it paves the way for sophisticated models trained on rich synthetic data, potentially enhancing model generalization and performance in real-world applications. Future developments in AI will likely build on these foundations, integrating more advanced knowledge-enhancement techniques and human-in-the-loop systems to refine and expand the capabilities of LLMs in synthetic data generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1805081501597564946

https://twitter.com/fly51fly/status/1805355820097716629

https://twitter.com/gm8xx8/status/1805057747181863067

https://twitter.com/knishimae0531/status/1807620493962527067

https://twitter.com/knishimae0531/status/1805399191923179766

https://twitter.com/Aimped_AI/status/1807802091370418257