Emergent Mind

Scaling Laws of Synthetic Images for Model Training ... for Now

(2312.04567)
Published Dec 7, 2023 in cs.CV

Abstract

Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.

Comparison of CLIP models on 15 tasks, using synthetic, real, and combined images under zero-shot classification.

Overview

  • Synthetic data generation is a significant avenue for training data augmentation in machine learning.

  • Studies show synthetic data is less efficient than real images for supervised models, but still follows a power-law in scaling.

  • Synthetic data shines in specific cases, such as limited real data scenarios and out-of-distribution generalization, especially in CLIP training.

  • The efficiency of synthetic data in training is influenced by the choice of text-to-image models, guidance scales, and text prompts.

  • Future work is needed to improve generative models, which could potentially allow synthetic data to match or surpass real data in model training.

The Impact of Synthetic Data on Machine Learning Models

Understanding Synthetic Data in Model Training

In the realm of machine learning, the availability and quality of training data is a cornerstone of building robust models. Synthetic data generation has come to the forefront as a means to augment the limited supply of curated datasets. Researchers have been exploring the effectiveness of using images created by text-to-image models for training purposes. A recent examination of this approach has provided new insights into the effectiveness of synthetic data in training both supervised models and CLIP (Contrastive Language–Image Pretraining) models.

Key Findings from Recent Studies

Effectiveness in Supervised Models

When it comes to image classifiers trained under supervised settings, synthetic data has shown scaled efficiency, albeit less effectively when compared to real images. The power-law relationship between training data size and validation loss applies here, although the convergence of this loss ratio experiences a shift when the synthetic dataset becomes exceedingly large. The inability of text-to-image models to render certain complex concepts appears to be a pivotal factor in this scaling inefficiency.

Advantages in Special Scenarios

Despite its general limitations, synthetic data demonstrates particular advantages in specific scenarios:

  • Instances of limited real data for supervised problems show that synthetic data can be scaled more effectively.
  • Synthetic data can outperform real data in out-of-distribution tests, suggesting it may be useful for generalizing beyond the original data distribution.
  • In CLIP training, the combination of synthetic and real data can significantly boost model performance, particularly in cases wherein available training data is scarce.
Influence of Model Choices and Prompts

Furthermore, the study uncovers that different choices in text-to-image models, classifier-free guidance scale, and the nature of text prompts have significant impacts on the scaling efficiency of synthetic data. After optimizing these variables, it became evident that synthetic data yielded a similar scaling trend to real data, especially for CLIP training, though it remained slightly less effective.

Implications for the Future

The insights from this research imply that synthetic data has the potential to be particularly effective in conditions where there is a substantial domain shift or when real images are not abundant. This is an encouraging development for scenarios that demand extensive data diversification or where data curation is challenging. Looking ahead, the results stress the need to refine the existing generative models to overcome their current limitations, which could eventually enable synthetic data to rival or even outperform real data in a wide range of training situations.

The contribution of this study enriches our understanding of the role synthetic data can play as we continue to push the boundaries of machine learning capabilities and seek new solutions to data limitations.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.