Emergent Mind

Diffusion Model with Perceptual Loss

(2401.00110)
Published Dec 30, 2023 in cs.CV , cs.AI , and cs.LG

Abstract

Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.

Overview

  • Diffusion models efficiently create structured data from noise, especially in image generation.

  • Traditional MSE loss has limitations in reality capture, prompting research into classifier-free guidance.

  • The paper introduces and evaluates a self-perceptual objective, leveraging intrinsic perceptual loss in diffusion models.

  • Self-perceptual objective improves realism in image generation and diversification in both conditional and unconditional models.

  • The findings reveal improved sample quality over MSE loss, though text-to-image generation still excels with classifier-free guidance.

Introduction to Diffusion Models

Diffusion models are innovative generative models designed to transform random noise into structured and meaningful data, such as images, through a process of denoising. The procedure to create new samples can be thought of as a reverse simulation where noise is incrementally removed to uncover the data representation. These models have achieved remarkable success in image generation and their capabilities extend to other forms of media.

The standard training for diffusion models involves an approach known as the mean squared error (MSE) loss function. While this method is conceptually straightforward, it has some shortcomings in producing highly realistic images. To address this, state-of-the-art models have employed techniques like classifier-free guidance, which has shown to enhance image quality significantly, but the reasons behind its effectiveness were not entirely clear until now.

Perceptual Loss and Improved Sample Quality

This paper reveals that the notable performance of classifier-free guidance in producing high-quality samples is due, in part, to its implicit use of perceptual guidance. The idea is to integrate perceptual loss – a measure more aligned with human visual perception – directly within the training of diffusion models. Notably, the diffusion model itself can serve as an effective perceptual network, negating the need for external perceptual networks. By leveraging this intrinsic capability, the researchers propose a new training objective known as the self-perceptual objective.

Advantages of Self-Perceptual Objective

The proposed self-perceptual objective has multiple advantages:

  • It enhances the realism of generated images without compromising diversity for conditional generation - a common issue with classifier-free guidance.
  • Unlike classifier-free guidance, which is specialized for improving conditional models, the self-perceptual objective is also capable of boosting the quality of unconditional models, where no guidance is provided by annotations or labels.
  • It is designed to avoid the limitations of classifier-free guidance, such as overexposure and over-saturation artifacts that can appear at strong guidance scales.

Evaluating the New Objective

The researchers conducted thorough evaluations, including both qualitative and quantitative analyses, across a variety of datasets and conditions. The findings confirm that self-perceptual training offers a meaningful increase in sample quality over the traditional MSE loss. However, in text-to-image generation, classifier-free guidance still produces the best overall images because it enhances alignment with the text prompts at the expense of sample diversity.

In summary, this paper presents a promising direction for training diffusion models with perceptual loss to improve the realism of generated images. The self-perceptual objective acts as a powerful tool for future developments in generative modeling, particularly within the realms of image, video, and audio synthesis.

The implementation provided outlines a PyTorch-based approach to self-perceptual training, showcasing a practical means by which researchers and AI practitioners can apply these findings to their own work.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.