Emergent Mind

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

(2312.02253)
Published Dec 4, 2023 in cs.CV , cs.AI , and cs.LG

Abstract

Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage LLMs and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.

Overview

  • This paper suggests a new method to train visual recognition models using synthetic data, avoiding fine-tuning of generative models.

  • Techniques introduced include Label Ambiguity Resolution and diversification methods to improve the variety and quality of synthetic images.

  • The usage of domain adaptation techniques helps prevent overfitting to the synthetic domain.

  • Experiments show that this new method outperforms traditional fine-tuning approaches when scaling up synthetic data.

  • The approach offers efficient, scalable, and improved performance for visual recognition models with enhanced out-of-domain generalization.

Recent advances in computer vision have shown promising results in using synthetic data to enhance visual recognition models, especially when only a small number of real images are available for training. Synthetic data refers to artificially generated images, typically created to augment real datasets for model training. The conventional approach has been to fine-tune diffusion models on ImageNet and employ the fine-tuned models to generate synthetic images conditioned on ImageNet class labels. However, this strategy comes with challenges.

One of the key issues is the complexity added by the fine-tuning process, which not only requires considerable computational resources but also must be repeated for different datasets, increasing the computational burden significantly. Additionally, when synthetic images begin to outnumber real ones in the training data, the recognition models tend to perform poorly due to the domain gap—the notable differences in data distribution between synthetic and real images.

In light of this, researchers have proposed a new approach that avoids the fine-tuning of generative models. By utilizing pre-trained diffusion models and leveraging advanced techniques, they have developed a scalable framework for generating a vast number of synthetic images to train recognition models.

The framework introduced three innovative techniques:

  1. Label Ambiguity Resolution (LAR): LAR addresses the issue of class names with multiple meanings. For example, the word "crane" could refer to a bird or machinery. To avoid semantic misalignment, where synthetic images could represent the wrong concept, LAR uses LLMs to provide context to the class names, ensuring that the generated synthetic images match the intended subject.
  2. Diversification Methods: The framework introduces contextualized diversification (CD) and stylized diversification (SD) to enhance the variety of synthetic images. CD combines LLM-derived contextual elements, such as different environments or camera angles, with the class name to form rich and varied prompts. SD, on the other hand, adds distinct visual styles to the images—such as turning a photo into a painting or sketch—through prompts adjusted for style.
  3. Domain Adaptation Techniques: To mitigate the risk of overfitting due to domain shifts between real and synthetic images, the framework incorporates separate auxiliary batch normalization layers for recognition models that specifically process synthetic images. It also maintains a balanced number of real and synthetic images in training batches to prevent the model from overfitting to the synthetic domain.

Notably, the approach without generative finetuning has shown significant improvements over previous fine-tuning methodologies. Experiments revealed consistent performance enhancements as the volume of synthetic data scaled up to six times the size of the original ImageNet dataset, in stark contrast to prior work where recognition performance declined. Models using this novel pipeline also demonstrated robust out-of-domain generalization.

In summary, this new method presents a compelling alternative to the traditional fine-tuning pipeline for synthetic image generation. It promises an efficient and effective way to capitalize on synthetic data for large-scale visual recognition training, fostering better model performance and adaptability.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.