A Survey on Data Augmentation in Large Model Era (2401.15422v2)

Published 27 Jan 2024 in cs.LG, cs.CL, and cs.CV

Abstract: Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence, garnering significant interest from both academic and industrial spheres. However, the training of these large models necessitates vast quantities of high-quality data, and with continuous updates to these models, the existing reservoir of high-quality data may soon be depleted. This challenge has catalyzed a surge in research focused on data augmentation methods. Leveraging large models, these data augmentation techniques have outperformed traditional approaches. This paper offers an exhaustive review of large model-driven data augmentation methods, adopting a comprehensive perspective. We begin by establishing a classification of relevant studies into three main categories: image augmentation, text augmentation, and paired data augmentation. Following this, we delve into various data post-processing techniques pertinent to large model-based data augmentation. Our discussion then expands to encompass the array of applications for these data augmentation methods within natural language processing, computer vision, and audio signal processing. We proceed to evaluate the successes and limitations of large model-based data augmentation across different scenarios. Concluding our review, we highlight prospective challenges and avenues for future exploration in the field of data augmentation. Our objective is to furnish researchers with critical insights, ultimately contributing to the advancement of more sophisticated large models. We consistently maintain the related open-source materials at: https://github.com/MLGroup-JLU/LLM-data-aug-survey.

Citations (16)

View on Semantic Scholar

Summary

The paper presents a comprehensive review of data augmentation techniques driven by large models, categorizing methods into image, text, and paired data augmentation.
Key strategies include prompt-driven, subject-driven, and label-based approaches, with case studies like CamDiff and ChatGPT demonstrating enhanced data synthesis.
The survey highlights practical applications in NLP, computer vision, and audio processing while addressing challenges such as optimal data scale and robust evaluation metrics.

A Survey on Data Augmentation in the Large Model Era

Overview

The paper "A Survey on Data Augmentation in Large Model Era" by Yue Zhou et al. delivers a thorough review of data augmentation techniques driven by large models, examining their evolution, current methodologies, and applications across various domains. The paper categorizes the advancements in data augmentation into three primary categories: image augmentation, text augmentation, and paired data augmentation, offering a detailed discussion on innovative methods in each area.

Data Augmentation Methods

Image Augmentation

Image augmentation using large models is approached through prompt-driven and subject-driven strategies. Prompt-driven methods are further divided into text, visual, and multimodal categories:

Text Prompt-driven: This involves generating images based on textual descriptions, with notable methods like CamDiff and DiffEdit showing effective synthesis capabilities. These models leverage pre-trained checkpoints and diffusion models to produce high-quality imagery aligned with the input prompts.
Visual Prompt-driven: Techniques such as ImageBrush utilize visual cues, like transformation images, to direct the synthesis process. This category emphasizes retaining structural integrity and high-level semantics from reference images.
Multimodal Prompt-driven: Combining textual descriptions and visual cues, multimodal methods like ControlNet manipulate comprehensive features to create detailed and contextually rich images.

Subject-driven approaches, like DreamBooth and Custom Diffusion, focus on generating diverse and personalized renditions of user-provided subjects, maintaining the subject's unique characteristics while introducing context variations.

Text Augmentation

Text augmentation techniques are classified into label-based and generated content-based strategies:

Label-based: Methods such as Augmented SBERT leverage large models to annotate data, enriching datasets with labeled examples to improve tasks like text classification and question-answering.
Generated Content-based: Advanced models like ChatGPT are employed to synthesize varied and contextually aligned textual data, enhancing dataset diversity and aiding complex tasks like dialogue summarization.

Paired Data Augmentation

Paired data augmentation leverages the generative capabilities of large models to create multimodal datasets. Techniques like MixGen and BigAug generate image-text pairs or audio-text pairs, ensuring semantic consistency and enriching datasets for tasks in vision-language representation learning.

Data Post-processing Techniques

To ensure the quality of augmented data, several post-processing techniques are employed:

Top-K Selection: Methods like those used by InPars refine datasets by retaining the top- $K$ relevant instances based on pre-defined criteria, significantly improving the performance of downstream tasks.
Model-based Approaches: Models like Flan-UL2 employ round-trip consistency techniques to filter and validate the generated text data, ensuring reliability and relevance.
Score-based Approaches: Using metrics like Dice loss and NLP-based heuristics, score-based techniques filter out low-quality or misaligned data, retaining only high-quality samples.
Cluster-based Approaches: Approaches like those in yu2023diffusion leverage clustering methods to categorize and retain diverse, representative samples from the generated data.

Applications

The implications of these data augmentation techniques span across various domains:

NLP: Enhanced performance is observed in text classification, question answering, machine translation, and natural language inference tasks. Methods like AugGPT and FewGen have significantly improved model accuracies and generalization capabilities.
Computer Vision (CV): Techniques like DA-Fusion and SeedSelect show remarkable improvements in image classification, semantic segmentation, and object detection. Augmented data has led to more robust models capable of handling complex visual tasks.
Audio Signal Processing (ASP): Data augmentation methods have improved the robustness of models in tasks such as automatic speech recognition, speech emotion recognition, and automated audio captioning.

Challenges and Future Directions

The paper highlights several challenges and proposes future research directions:

Theoretical Understanding: There is a need for a solid theoretical foundation to guide the application of data augmentation methods, ensuring effective and suitable techniques for diverse datasets.
Scale of Augmented Data: Identifying the optimal quantity of augmented data is crucial. Excessive augmentation could lead to diminished returns or performance degradation.
Multimodal Data Augmentation: There is a notable gap in the development of simultaneous multimodal data augmentation techniques, which hold potential for enriched and comprehensive datasets.
Language and Vision Foundation Models: Bridging the gap between LLMs and robust vision foundation models remains a significant challenge.
Automatic Data Augmentation: Developing methods for automatic selection of augmentation strategies can enhance the generalizability and efficiency of data augmentation processes.
Robust and Consistent Data Generation: Ensuring the robustness and consistency of augmented data is paramount to prevent unintended biases or inaccuracies.
Trustworthiness: Addressing biases and toxicity in generated data is crucial, especially for sensitive applications.
Instruction Follow-through: Evaluating and improving the adherence of large models to instructions is critical for generating high-quality augmented data.
Evaluation Metrics: Establishing direct evaluation metrics for augmented data, independent of specific tasks, is essential for assessing data quality.
Augmented Data in Model Training: Leveraging augmented data to train large models effectively, coupled with robust evaluation frameworks, can significantly advance the field.

Conclusion

Data augmentation, driven by the capabilities of large models, plays a pivotal role in advancing AI performance across various domains. As researchers overcome existing challenges and refine augmentation techniques, substantial improvements in model robustness, diversity, and generalization are expected. This survey underscores the importance and potential of large model-based data augmentation, setting the stage for future innovations in the field.

Related Papers

GitHub

GitHub - MLGroup-JLU/LLM-data-aug-survey: The official GitHub page for the survey paper "A Survey on Data Augmentation in Large Model Era" (108 stars)

Tweets

https://twitter.com/fly51fly/status/1754647264709038168

https://twitter.com/morris_phd/status/1765063970111754688