Emergent Mind

Slight Corruption in Pre-training Data Makes Better Diffusion Models

(2405.20494)
Published May 30, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

Class and text-conditional DMs pre-trained with varying corruption levels enhance image quality and diversity.

Overview

  • The paper empirically and theoretically explores how slight corruption in pre-training data can improve the performance of diffusion models (DMs), particularly in generating higher quality and more diverse content.

  • By introducing synthetic corruptions to datasets such as ImageNet-1K (IN-1K) and CC3M, the research shows that models trained with up to 7.5% corrupted data outperform those trained on clean data across various metrics.

  • The authors propose a novel method called Conditional Embedding Perturbation (CEP), which adds noise to conditional embeddings during training, resulting in significant performance improvements in both pre-training and downstream personalization tasks.

Impact of Data Corruption on Diffusion Models: Empirical and Theoretical Insights

Slight Corruption in Pre-training Data Makes Better Diffusion Models

The paper addresses an essential aspect of diffusion models (DMs): the impact of data corruption during the pre-training phase. DMs have demonstrated incredible potential in generating high-quality images, audio, and video content. These models, however, depend heavily on large-scale data harvested from the web, which often contains noisy, inaccurate, and corrupted data pairs.

Empirical Evaluations and Findings

The authors present a thorough empirical study to understand how slight corruption in pre-training data affects DMs. By intentionally introducing synthetic corruptions to ImageNet-1K (IN-1K) and CC3M datasets, the research investigates more than 50 conditional DMs. The results are counterintuitive and significant: slight corruption (up to 7.5%) can enhance the generated content's quality, diversity, and fidelity compared to models trained exclusively on clean data.

Key Findings

  1. Enhanced Quality: Models pre-trained with slightly corrupted data achieve lower Fréchet Inception Distance (FID) and higher Inception Score (IS) and CLIP scores.
  2. Increased Diversity: Corrupted models show higher entropy, indicating a more diverse sample distribution. The Relative Mahalanobis Distance (RMD) score also highlights higher image complexity and diversity.
  3. Downstream Personalization: Models influenced by slight data corruption during pre-training perform better in downstream tasks, like those involving ControlNet and T2I-Adapter personalization on IN-100 datasets.

Theoretical Analysis

In addition to empirical findings, the paper offers a theoretical framework based on Gaussian mixture models to further substantiate its claims. The authors present two crucial theorems:

  1. Generation Diversity: Theorem 1 demonstrates that slight corruption increases the entropy of generated distributions ((\mathbf{z}_T)) compared to clean conditions, resulting in greater diversity in the generated images.
  2. Generation Quality: Theorem 2 shows that slight corruption decreases the (2)-Wasserstein distance between the generated and real data distributions, leading to higher quality generated content.

Methodology: Conditional Embedding Perturbation (CEP)

Inspired by the empirical and theoretical findings, the authors propose a novel method termed Conditional Embedding Perturbation (CEP). CEP introduces perturbations in the conditional embeddings during training:

Results with CEP

Training DMs using CEP manifests substantial improvements in both pre-training performance and downstream personalization:

  • Pre-training: Results on IN-1K and CC3M indicate that CEP improves FID, IS, and Precision-Recall metrics compared to baseline models.
  • Personalization: When applied to personalization tasks (e.g., with ControlNet), CEP enhances model performance, producing more reliable and visually appealing images in downstream applications.

Practical and Theoretical Implications

This study has several significant implications:

  • Practical: Given the unavoidable presence of data corruption in large-scale datasets, incorporating CEP during pre-training can enhance DM performance without the need for perfect data.
  • Theoretical: The findings advocate a re-examination of the conventional wisdom that clean data always yield the best models. Slight corruption can serve as an implicit regularization, preventing overfitting.

Future Directions

The research opens various future avenues, including:

  • Expansion to Other Modalities: Extending the findings to audio and video diffusion models.
  • Robustness in Real-world Data: Applications in domain-specific scenarios like autonomous driving and healthcare, where data corruption is prevalent but high-quality performance is critical.
  • Adapting Theoretical Models: Refining theoretical models to better capture the nuanced behavior of DMs under data corruption.

Conclusion

The paper challenges the conventional belief that data corruption necessarily detracts from model performance. Instead, it reveals that slight data corruption during pre-training can have beneficial effects, improving the generalization capability and diversity of diffusion models. The Conditional Embedding Perturbation (CEP) technique proposed by the authors offers a straightforward yet effective approach to harness this phenomenon, leading to better-performance diffusion models for a broad array of applications. This foundational work prompts a paradigm shift that may influence future research and practical implementations in diffusion modeling and beyond.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.