Emergent Mind

Unified Auto-Encoding with Masked Diffusion

(2406.17688)
Published Jun 25, 2024 in cs.CV and cs.AI

Abstract

At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.

Unified Masked Diffusion: Combines random masking with fine-grain noise, targets original image prediction.

Overview

  • The paper introduces a Unified Masked Diffusion (UMD) framework that integrates diffusion models and masked auto-encoders to excel in both de-noising tasks and representation learning.

  • The proposed UMD method employs an innovative noise-free masked reconstruction step and a mixed corruption technique, optimizing computational efficiency and effectiveness.

  • Empirical evaluations demonstrate that UMD outperforms traditional baseline methods in various benchmarks, achieving competitive results in representation learning and generative modeling tasks.

Unified Auto-Encoding with Masked Diffusion

The paper "Unified Auto-Encoding with Masked Diffusion" by Hansen-Estruch et al. addresses a longstanding issue in the convergence of generative and self-supervised representation learning. The authors introduce a new framework, termed Unified Masked Diffusion (UMD), that combines two contrasting yet fundamentally similar modeling paradigms: diffusion models and masked auto-encoders. The intent is to develop a single auto-encoding model capable of excelling in both de-noising tasks and representation learning without the overhead costs typically associated with these models.

Methodology

The core innovation of UMD lies in its ability to integrate both patch-based and noise-based corruption techniques. The authors propose modifications to the standard diffusion transformer (DiT) training process, introducing an additional noise-free, high masking representation step in the diffusion noising schedule. Specifically, this approach involves:

  1. Incorporating a noise-free masked reconstruction step added to the diffusion process via a modified variance schedule.
  2. Employing a mix of high-ratio masking and noise-infusion for subsequent timesteps.
  3. Unifying the MAE and diffusion models under a common architecture that retains computational efficiency.

The UMD framework optimizes for both reconstruction of the masked image patches and denoising of the noised patches concurrently, inferring representations and generating samples in a uniform manner.

Contributions and Results

This work makes two significant contributions:

  1. Self-Supervised Learning Method: UMD successfully aligns masked image model reconstruction and diffusion-based de-noising within a single auto-encoding framework. The approach outperforms traditional MAE and DiT baselines in both computational efficiency and effectiveness in downstream tasks.
  2. Empirical Analysis: Through thorough empirical evaluations, UMD demonstrates strong performance across various benchmarks, including ImageNet linear probing accuracy and class-conditional image generation, emphasizing the efficiency and utility in diverse tasks.

Experimental Findings

Representation Learning:

  • Linear probing on representation learning tasks shows that UMD achieves competitive accuracy rates compared to state-of-the-art methods, particularly notable in closely rivaling MAE while drastically reducing training times.
  • Few-shot transfer learning experiments indicate that UMD captures robust representations conducive to generalizability across multiple out-of-distribution datasets.

Generative Performance:

  • For class-conditional image generation, UMD rivals conventional Diffusion Models like DiT, obtaining near-equivalent results in metrics such as Fréchet Inception Distance (FID) and Inception Score (IS).
  • UMD achieves favorable computational efficiency, requiring fewer GPU hours compared to its counterparts while maintaining similar or better performance metrics.

Implications and Future Work

UMD's implications span both theoretical and practical domains. The framework provides a clear pathway to combine generative and representational power within a single model, optimizing computation and potentially democratizing access to high-performing models in environments with limited computational resources. The results indicate that de-noising auto-encoders can be tuned to yield both strong generative and discriminative capabilities.

For future work, the authors suggest further exploration into dynamically learning the noise schedules to enhance the model's corruption and reconstruction processes. Additionally, extending the UMD framework to other data modalities beyond images—such as text or audio—could further validate its versatility and performance across broader application domains in AI.

In conclusion, the UMD framework offers substantial strides in resolving the dichotomy between generative modeling and self-supervised learning. This convergence can potentially lead to more efficient and versatile machine learning models, driving advancements in both theoretical methodologies and practical applications of AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.