Papers
Topics
Authors
Recent
2000 character limit reached

Diffusion Model with Perceptual Loss (2401.00110v7)

Published 30 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models without guidance generate very unrealistic samples. Guidance is used ubiquitously, and previous research has attributed its effect to low-temperature sampling that improves quality by trading off diversity. However, this perspective is incomplete. Our research shows that the choice of the loss objective is the underlying reason raw diffusion models fail to generate desirable samples. In this paper, (1) our analysis shows that the loss objective plays an important role in shaping the learned distribution and the MSE loss derived from theories holds assumptions that misalign with data in practice; (2) we explain the effectiveness of guidance methods from a new perspective of perceptual supervision; (3) we validate our hypothesis by training a diffusion model with a novel self-perceptual loss objective and obtaining much more realistic samples without the need for guidance. We hope our work paves the way for future explorations of the diffusion loss objective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
  2. Align your latents: High-resolution video synthesis with latent diffusion models, 2023.
  3. Magicdance: Realistic human dance video generation with motions & facial expressions transfer, 2023.
  4. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  5. Diffusion models beat gans on image synthesis, 2021.
  6. Structure and content-guided video synthesis with diffusion models, 2023.
  7. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023.
  8. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
  9. Imagen video: High definition video generation with diffusion models, 2022.
  10. Denoising diffusion probabilistic models, 2020.
  11. Classifier-free diffusion guidance, 2022.
  12. Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
  13. Elucidating the design space of diffusion-based generative models, 2022.
  14. Analyzing and improving the training dynamics of diffusion models, 2023.
  15. Common diffusion noise schedules and sample steps are flawed, 2023.
  16. Microsoft coco: Common objects in context, 2015.
  17. Flow matching for generative modeling, 2023.
  18. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  19. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022.
  20. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
  21. On aliased resizing and surprising subtleties in gan evaluation, 2022.
  22. Scalable diffusion models with transformers, 2023.
  23. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
  24. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  25. Dreamfusion: Text-to-3d using 2d diffusion, 2022.
  26. Learning transferable visual models from natural language supervision, 2021.
  27. Hierarchical text-conditional image generation with clip latents, 2022.
  28. High-resolution image synthesis with latent diffusion models, 2022.
  29. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  30. Improved techniques for training gans, 2016.
  31. Progressive distillation for fast sampling of diffusion models, 2022.
  32. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  33. Mvdream: Multi-view diffusion for 3d generation, 2023.
  34. Make-a-video: Text-to-video generation without text-video data, 2022.
  35. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  36. Denoising diffusion implicit models, 2022.
  37. Score-based generative modeling through stochastic differential equations, 2021.
  38. Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661–1674, 07 2011.
  39. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, dec 2010.
  40. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004.
  41. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation, 2023.
  42. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
  43. Magicprop: Diffusion-based video editing via motion-aware appearance propagation, 2023.
  44. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
  45. Movq: Modulating quantized vectors for high-fidelity image generation, 2022.
  46. Magicvideo: Efficient video generation with latent diffusion models, 2023.
Citations (12)

Summary

  • The paper proposes integrating a self-perceptual loss into diffusion models, replacing conventional MSE to enhance image realism.
  • The methodology leverages the model’s latent space and noise scheduling to compute perceptual loss, leading to improved FID and Inception scores.
  • Extensive ablation studies validate that the self-perceptual approach boosts performance in unconditional generation without relying on external guidance.

Diffusion Model with Perceptual Loss

Introduction

The paper presents an enhancement to diffusion models for generative tasks by incorporating a self-perceptual loss as opposed to the conventional mean squared error (MSE) loss. Diffusion models have emerged as a popular class of generative models that transform noise into data samples via an iterative denoising process. This transformation process traditionally uses MSE loss, which has been shown to often result in unrealistic image samples. The paper creatively suggests that perceptual loss derived inherently from the diffusion model can yield more realistic outputs, bypassing the need for external perceptual networks.

Diffusion Models and Perceptual Loss

Diffusion models are generally parameterized as neural networks, trained using MSE to minimize the difference between model predictions and ground truth. However, due to ambiguity in the underlying data distribution (Figure 1), these models struggle to generate high-quality outputs intrinsically. The paper highlights the limitations of classifier-free guidance, a popular technique employed to enhance sample quality, and proposes using the model itself to generate perceptual loss, which can be applied even to unconditional generation models—a feat previously unattainable. Figure 1

Figure 1: The underlying data distribution is ambiguous given finite training data.

Self-Perceptual Objective

The novel self-perceptual objective leverages the diffusion model as its perceptual network, exploiting the inherent architectural properties of the model to improve the quality of generated samples. The approach focuses on aligning with human perceptual metrics better than MSE. It capitalizes on the latent space and noise levels at different training phases to backpropagate perceptual loss, thereby enhancing output quality without sacrificing diversity.

Methodology

The diffusion process is set up to select image latent samples, noise, and time steps, advancing through a defined diffusion schedule. This schedule incorporates perceptual objectives by freezing a copy of the diffusion model trained with traditional MSE loss. The training uses pseudo-code implementations to streamline integration into existing diffusion frameworks, enhancing reproducibility and application across various model architectures.

Evaluation and Results

Quantitative results demonstrate improvements in Fréchet Inception Distance (FID) and Inception Score (IS), showcasing better alignment with perceptual quality standards over conventional methods. Qualitative assessments further exhibit superior result authenticity and visual appeal when utilizing the proposed methodology (Figure 2). Figure 2

Figure 2

Figure 2: Unconditional generation. Both use DDIM 1000 steps with the same seed. Our self-perceptual objective can improve unconditional generation quality. This was previously not possible with classifier-free guidance because it only works for conditional models.

Ablation Studies

Extensive ablation studies highlight optimal configurations, including layer influence on perceptual loss and timestep selections for generating features. The paper demonstrates the superior balance and performance attained through perceptual loss across various tested parameters.

Conclusions

The research demonstrates the effectiveness of integrating self-perceptual objectives into diffusion training, thereby enhancing sample quality, particularly for unconditional models. Although classifier-free guidance still surpasses in text-alignment scenarios, the proposed method sets a precedent for exploring internal perceptual capacities of generative models, potentially informing future architecture designs, sampling strategies, and loss function optimizations.

The exploration of perceptual loss directly harnessed from diffusion models offers a promising avenue for advancing generative model quality, with implications for broader applications across image, video, and audio modalities. The ability to enhance outputs without additional guidance mechanisms marks a significant step in generative model advancements. The findings encourage further exploration of model-inherent properties to guide training dynamics and output quality, potentially reshaping approaches to diffusion-based generative modeling.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 8 tweets with 65 likes about this paper.