Slight Corruption in Pre-training Data Makes Better Diffusion Models (2405.20494v2)

Published 30 May 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs and all models are released at https://huggingface.co/DiffusionNoise.

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that introducing up to 7.5% data corruption during pre-training improves quality, as evidenced by lower FID and higher IS and CLIP scores.
The study shows that corrupted models yield increased diversity, with metrics like entropy and Relative Mahalanobis Distance indicating more complex sample distributions.
The novel Conditional Embedding Perturbation (CEP) method proves effective in boosting both pre-training and downstream personalization performance across tasks like ControlNet.

Impact of Data Corruption on Diffusion Models: Empirical and Theoretical Insights

Slight Corruption in Pre-training Data Makes Better Diffusion Models

The paper addresses an essential aspect of diffusion models (DMs): the impact of data corruption during the pre-training phase. DMs have demonstrated incredible potential in generating high-quality images, audio, and video content. These models, however, depend heavily on large-scale data harvested from the web, which often contains noisy, inaccurate, and corrupted data pairs.

Empirical Evaluations and Findings

The authors present a thorough empirical paper to understand how slight corruption in pre-training data affects DMs. By intentionally introducing synthetic corruptions to ImageNet-1K (IN-1K) and CC3M datasets, the research investigates more than 50 conditional DMs. The results are counterintuitive and significant: slight corruption (up to 7.5%) can enhance the generated content's quality, diversity, and fidelity compared to models trained exclusively on clean data.

Key Findings

Enhanced Quality: Models pre-trained with slightly corrupted data achieve lower Fréchet Inception Distance (FID) and higher Inception Score (IS) and CLIP scores.
Increased Diversity: Corrupted models show higher entropy, indicating a more diverse sample distribution. The Relative Mahalanobis Distance (RMD) score also highlights higher image complexity and diversity.
Downstream Personalization: Models influenced by slight data corruption during pre-training perform better in downstream tasks, like those involving ControlNet and T2I-Adapter personalization on IN-100 datasets.

Theoretical Analysis

In addition to empirical findings, the paper offers a theoretical framework based on Gaussian mixture models to further substantiate its claims. The authors present two crucial theorems:

Generation Diversity: Theorem 1 demonstrates that slight corruption increases the entropy of generated distributions ( $\mathbf{z}_T$ ) compared to clean conditions, resulting in greater diversity in the generated images.
Generation Quality: Theorem 2 shows that slight corruption decreases the $2$-Wasserstein distance between the generated and real data distributions, leading to higher quality generated content.

Methodology: Conditional Embedding Perturbation (CEP)

Inspired by the empirical and theoretical findings, the authors propose a novel method termed Conditional Embedding Perturbation (CEP). CEP introduces perturbations in the conditional embeddings during training:

Implementation: CEP modifies the DM training objective by adding either uniform or Gaussian noise to the conditional embeddings.
Performance: CEP significantly outperforms methods like Input Perturbations (IP), leading to better conditional diffusion models across various tasks and metrics.

Results with CEP

Training DMs using CEP manifests substantial improvements in both pre-training performance and downstream personalization:

Pre-training: Results on IN-1K and CC3M indicate that CEP improves FID, IS, and Precision-Recall metrics compared to baseline models.
Personalization: When applied to personalization tasks (e.g., with ControlNet), CEP enhances model performance, producing more reliable and visually appealing images in downstream applications.

Practical and Theoretical Implications

This paper has several significant implications:

Practical: Given the unavoidable presence of data corruption in large-scale datasets, incorporating CEP during pre-training can enhance DM performance without the need for perfect data.
Theoretical: The findings advocate a re-examination of the conventional wisdom that clean data always yield the best models. Slight corruption can serve as an implicit regularization, preventing overfitting.

Future Directions

The research opens various future avenues, including:

Expansion to Other Modalities: Extending the findings to audio and video diffusion models.
Robustness in Real-world Data: Applications in domain-specific scenarios like autonomous driving and healthcare, where data corruption is prevalent but high-quality performance is critical.
Adapting Theoretical Models: Refining theoretical models to better capture the nuanced behavior of DMs under data corruption.

Conclusion

The paper challenges the conventional belief that data corruption necessarily detracts from model performance. Instead, it reveals that slight data corruption during pre-training can have beneficial effects, improving the generalization capability and diversity of diffusion models. The Conditional Embedding Perturbation (CEP) technique proposed by the authors offers a straightforward yet effective approach to harness this phenomenon, leading to better-performance diffusion models for a broad array of applications. This foundational work prompts a paradigm shift that may influence future research and practical implementations in diffusion modeling and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jd92wang/status/1799011209695592803

https://twitter.com/__z__9/status/1797770731499225583

https://twitter.com/YouJiacheng/status/1869574397713105275

https://twitter.com/realmofresearch/status/1797617211080941777

https://twitter.com/gm8xx8/status/1797486154536874129