DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models (2308.06160v2)

Published 11 Aug 2023 in cs.CV

Abstract: Current deep networks are very data-hungry and benefit from training on largescale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthetic images and the corresponding high-quality perception annotations (e.g., segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models for downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more robust on domain generalization than using the real data alone; and state-of-the-art results in zero-shot segmentation setting; and 3) flexibility for efficient application and novel task composition (e.g., image editing). The project website and code can be found at https://weijiawu.github.io/DatasetDM_page/ and https://github.com/showlab/DatasetDM, respectively

Citations (69)

View on Semantic Scholar

Summary

The paper presents DatasetDM, a framework that leverages stable diffusion and a perception decoder to generate synthetic datasets with complex perception annotations from minimal labeled data.
The method integrates hypercolumn extraction and prompt diversification via generative models, achieving a 13.3% mIoU improvement on VOC 2012 and a 12.1% AP increase on COCO 2017.
The approach drastically reduces labeling requirements to around 100 images, offering scalable and cost-efficient solutions for training robust computer vision models.

An Expert Review of "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models"

The paper "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models" offers a methodological advancement in generating synthetic datasets for training perception models in computer vision. Through the integration of diffusion models and a perception decoder, the authors present a framework called DatasetDM that substantially lowers the requirements of labeled data, while effectively producing synthetic data annotated with complex perception tasks, such as segmentation and depth estimation.

Methodology Overview

DatasetDM is anchored in leveraging a pre-trained diffusion model to extend beyond traditional text-to-image generation into text-to-data generation. This innovative approach excels by employing a small sample, consisting of less than 1% of existing labeled data, to train a perception decoder that synthesizes diverse and annotated datasets. The diffusion model used here, Stable Diffusion, already demonstrates high efficacy in creating detailed and diverse image outputs from text prompts.

Central to DatasetDM is the P-Decoder, a unified architecture that enables the translation from rich latent code representations into multi-task perception outputs, including semantic and instance segmentation, pose estimation, and depth. This is achieved through the segmentation into two primary operational stages: training and inference. The training stage involves extracting hypercolumn representations through diffusion inversion and text-image representation fusion, preparing the perception decoder to handle various tasks without directly relying on sizeable labeled datasets. The inference stage exploits generative LLMs like GPT-4 to diversify prompts, enhancing the generation of synthetic data with richer and more varied descriptors.

Numerical Results

The experimental evaluation demonstrates DatasetDM’s strong performance in generating synthetic datasets that significantly enhance perception tasks. On the VOC 2012 dataset, DatasetDM reports improvements of 13.3% on mean Intersection over Union (mIoU) for semantic segmentation. In a further test on COCO 2017, for instance segmentation, DatasetDM attains advancements of 12.1% in Average Precision (AP) across several experimental setups compared to using only limited real data. These improvements underscore DatasetDM's capacity to generalize well in tasks that were traditionally dependent on labor-intensive data annotation processes.

Bold Claims and Practical Implications

A notable claim of the paper is its assertion that training the perception decoder only requires around 100 manually labeled images to achieve state-of-the-art performance. This approach effectively broadens the accessibility of creating comprehensive datasets with the potential to consistently train robust perception models. Furthermore, DatasetDM's methodology addresses challenges in specialized domains like medical imaging, where sensitive data collection is fraught with barriers.

The implications of DatasetDM are manifold in practical terms. The ability to synthesize annotated data rapidly mitigates costs and logistical challenges associated with traditional dataset creation. The common issues of scalability and privacy in data collection are less acute, and DatasetDM offers a pathway toward broader generalization in models trained for diverse perceptual tasks.

Theoretical Implications and Future Directions

The paper’s methodological advancements lay foundations for extending generative models' applicability to broader perceptual tasks. DatasetDM’s use of pre-trained diffusion models showcases how learning rich representations from image-text pairs can be leveraged beyond their conventional application, venturing into other domains of perception.

Future directions could involve refining the model to incorporate more sophisticated prompt engineering or integrating even more advanced generative models. This paper hints at a new frontier for synthetic data generation methodologies, encouraging exploration in using generative frameworks as a stand-alone tool for perception model training rather than merely a supplementary resource.

In conclusion, this paper’s contributions to synthetic data generation resonate with ongoing needs for efficient dataset synthesis approaches in deep learning. DatasetDM, by stretching the boundaries of diffusion models into the perceptual domains, marks a valuable step forward in synthetic data utility for computer vision applications. Its transformative potential is further amplified by the pathway it charts for future developments on efficient, scalable, and diverse data generation solutions.