Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

Published 25 Sep 2023 in cs.CV | (2309.14303v4)

Abstract: Preparing training data for deep vision models is a labor-intensive task. To address this, generative models have emerged as an effective solution for generating synthetic data. While current generative models produce image-level category labels, we propose a novel method for generating pixel-level semantic segmentation labels using the text-to-image generative model Stable Diffusion (SD). By utilizing the text prompts, cross-attention, and self-attention of SD, we introduce three new techniques: class-prompt appending, class-prompt cross-attention, and self-attention exponentiation. These techniques enable us to generate segmentation maps corresponding to synthetic images. These maps serve as pseudo-labels for training semantic segmenters, eliminating the need for labor-intensive pixel-wise annotation. To account for the imperfections in our pseudo-labels, we incorporate uncertainty regions into the segmentation, allowing us to disregard loss from those regions. We conduct evaluations on two datasets, PASCAL VOC and MSCOCO, and our approach significantly outperforms concurrent work. Our benchmarks and code will be released at https://github.com/VinAIResearch/Dataset-Diffusion

Abstract PDF Upgrade to Chat

Citations (56)

View on Semantic Scholar

Summary

The paper introduces a diffusion-based method to synthesize pixel-level semantic segmentation datasets, cutting the need for manual annotations.
It leverages advanced attention mechanisms and class-prompt appending to create precise pseudo-labels from text-to-image models.
Experiments on PASCAL VOC and MSCOCO demonstrate improved mIoU scores, validating the approach’s effectiveness over existing methods.

Overview of "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation"

The paper by Nguyen et al. introduces a novel approach termed Dataset Diffusion to generate synthetic datasets for pixel-level semantic segmentation, utilizing diffusion models. This work addresses the significant challenge in semantic segmentation of obtaining the extensive labeled datasets typically required for training deep learning models. By leveraging the power of Stable Diffusion, a text-to-image generative model, the authors propose a method to create synthetic datasets with pixel-wise semantic labels, which can reduce the dependence on manually annotated real-world data.

Methodology

Dataset Diffusion involves a systematic three-stage process:

Text Prompt Preparation: The authors harness text-to-image diffusion models by crafting text prompts that explicitly state the target object classes. This involves using LLMs like ChatGPT to generate diverse prompts, supplemented by augmentation of text classes to ensure all target object categories are represented, thus addressing issues like missing labels in captions.
Segmentation Map Generation: The core technical contribution lies in the clever use of both self-attention and cross-attention mechanisms intrinsic to the diffusion process. The authors introduce innovations such as class-prompt appending and self-attention exponentiation. These techniques enhance the cross-attention maps, producing more precise segmentation maps that serve as pseudo-labels for training segmentation models.
Training Segmenters with Pseudo-Labels: The synthesized datasets, along with their segmentation maps, are employed to train semantic segmentation models (like DeepLabV3), incorporating strategies to handle pseudo-label uncertainty. This includes ignoring loss from uncertain regions and leveraging self-training to refine segmentations.

Experimental Results

The proposed approach is validated on established datasets, PASCAL VOC and MSCOCO, showing significant improvements over existing methods. The method delivers a notable mIoU of 64.8 on the synth-VOC benchmark and 34.2 on synth-COCO, outperforming concurrent solutions like Diffumask. The segmentation accuracy is robust across various object categories, highlighting the effectiveness of the generated synthetic data in mimicking real-world complexities.

Implications and Future Directions

The implications of this work are multifaceted. Practically, it presents a viable alternative to labor-intensive manual dataset labeling, making it possible to generate sufficiently annotated data using text prompts, thus democratizing access to high-quality training datasets. Theoretically, it opens new avenues in leveraging generative models for dataset creation, encouraging further research in improving the fidelity and variety of generated data.

Future developments could explore several enhancements, such as improving the textual embedding to generate more complex and diverse scenes. Also, addressing biases in the diffusion model’s training data could enhance the reliability and representativeness of generated datasets. Furthermore, refining the integration of self- and cross-attention for better segmentation accuracy could provide even closer alignment with real-world data distribution characteristics.

In conclusion, Dataset Diffusion stands as a promising method for synthetic dataset generation, capable of supplementing or even replacing traditional data labeling processes in certain contexts. It represents an incremental yet significant step towards more efficient and scalable semantic segmentation training methodologies.

Markdown Report Issue