Emergent Mind

Image Neural Field Diffusion Models

(2406.07480)
Published Jun 11, 2024 in cs.CV

Abstract

Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution's modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.

Comparison showing finer, high-frequency textures from method finetuned from Stable Diffusion versus with super-resolution.

Overview

  • The paper introduces Image Neural Field Diffusion Models (INFD), a novel framework for high-resolution image synthesis using image neural fields.

  • INFD overcomes the limitations of traditional fixed-resolution models by allowing image generation at arbitrary resolutions without the need for separate super-resolution processes.

  • The paper demonstrates significant performance improvements in generating high-resolution images, notably outperforming existing models in various scenarios.

Image Neural Field Diffusion Models

The paper "Image Neural Field Diffusion Models" by Yinbo Chen et al. presents an innovative framework for improving image synthesis using diffusion models trained on image neural fields. This model builds on the strengths of diffusion models and introduces several key modifications to enhance resolution-agnostic image generation.

The core motivation for this work stems from the limitations of fixed-resolution image synthesis techniques, which often require cumbersome upsampling processes to achieve high-resolution outputs. Traditional latent diffusion models (LDMs) operate on fixed-resolution images, necessitating separate super-resolution models for generating high-resolution images. However, this approach can introduce artifacts and degrade image quality. In contrast, the authors propose a method that leverages image neural fields to represent images continuously, allowing for generation at arbitrary resolutions without the need for additional super-resolution steps.

Methodology

The proposed Image Neural Field Diffusion Models (INFD) build upon the latent diffusion framework. The process is divided into two primary stages:

  1. Training an Image Neural Field Autoencoder:

    • An encoder is first trained to map images to latent representations.
    • A decoder then converts these latent representations back into feature maps, which are rendered into images using a Convolutional Local Image Function (CLIF) network.
    • CLIF is crucial for generating high-quality images at variable resolutions, addressing the shortcomings of the Learning Implicit Image Function (LIIF) used in previous approaches.
    • The autoencoder training employs a combination of L1 loss, perceptual loss, and GAN loss, with multi-scale patches to provide supervision at varying resolutions, enhancing the realism of generated images.
  2. Training a Latent Diffusion Model:

    • The latent space learned in the first stage forms the basis for training a diffusion model.
    • Standard diffusion steps are applied in the latent space, leveraging the learned latent representation to generate new samples.
    • This significantly reduces the computational load while allowing for high-resolution image synthesis.

Key Advantages

The paper highlights several advantages of this approach over traditional fixed-resolution models:

  • Mixed-Resolution Training: INFD can handle datasets with images of varying resolutions without downsampling, leveraging mixed-resolution datasets for more effective learning.
  • High-Resolution Synthesis: Unlike fixed-resolution models that rely on separate upsampling processes, INFD can generate high-resolution images directly from the latent space representations, ensuring higher quality and more coherent details.
  • Efficient Inverse Problem Solving: The resolution-agnostic nature of the model allows for the efficient solution of inverse problems with conditions defined at multiple scales.

Numerical Results

The paper provides strong numerical results, demonstrating the effectiveness of INFD in various scenarios. For instance, INFD significantly outperforms LDM combined with state-of-the-art super-resolution models on the FFHQ dataset, achieving lower pFID scores across different resolutions. On the Mountains dataset, INFD shows clear superiority in generating high-resolution details without the artifacts typically seen in GAN-based models.

Implications and Future Directions

Practically, INFD can be expected to simplify workflows that require high-resolution image synthesis, such as medical imaging, remote sensing, and digital art. Theoretically, this work pushes the boundary of generative modeling by integrating the continuous representation of images directly into the diffusion process, paving the way for more sophisticated and flexible generative models.

Future developments might include extending this approach to video synthesis, where temporal consistency across frames is crucial. Another promising direction is the integration of text-to-image models, as demonstrated qualitatively, to further enhance the fidelity and diversity of generated images across various domains.

In summary, this paper provides a robust framework that substantially enhances the capabilities of latent diffusion models to generate high-resolution, photorealistic images efficiently. The integration of image neural fields into diffusion models represents a significant step forward in generative modeling, offering both practical and theoretical advancements.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.