Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Unified Auto-Encoding with Masked Diffusion (2406.17688v1)

Published 25 Jun 2024 in cs.CV and cs.AI

Abstract: At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces UMD, a unified model that concurrently improves self-supervised representation learning and diffusion-based de-noising.
  • It employs a novel integration of patch-based masking and noise-based corruption within a modified diffusion transformer architecture.
  • Empirical results demonstrate competitive ImageNet accuracy and generation metrics while significantly reducing training time.

Unified Auto-Encoding with Masked Diffusion

The paper "Unified Auto-Encoding with Masked Diffusion" by Hansen-Estruch et al. addresses a longstanding issue in the convergence of generative and self-supervised representation learning. The authors introduce a new framework, termed Unified Masked Diffusion (UMD), that combines two contrasting yet fundamentally similar modeling paradigms: diffusion models and masked auto-encoders. The intent is to develop a single auto-encoding model capable of excelling in both de-noising tasks and representation learning without the overhead costs typically associated with these models.

Methodology

The core innovation of UMD lies in its ability to integrate both patch-based and noise-based corruption techniques. The authors propose modifications to the standard diffusion transformer (DiT) training process, introducing an additional noise-free, high masking representation step in the diffusion noising schedule. Specifically, this approach involves:

  1. Incorporating a noise-free masked reconstruction step added to the diffusion process via a modified variance schedule.
  2. Employing a mix of high-ratio masking and noise-infusion for subsequent timesteps.
  3. Unifying the MAE and diffusion models under a common architecture that retains computational efficiency.

The UMD framework optimizes for both reconstruction of the masked image patches and denoising of the noised patches concurrently, inferring representations and generating samples in a uniform manner.

Contributions and Results

This work makes two significant contributions:

  1. Self-Supervised Learning Method: UMD successfully aligns masked image model reconstruction and diffusion-based de-noising within a single auto-encoding framework. The approach outperforms traditional MAE and DiT baselines in both computational efficiency and effectiveness in downstream tasks.
  2. Empirical Analysis: Through thorough empirical evaluations, UMD demonstrates strong performance across various benchmarks, including ImageNet linear probing accuracy and class-conditional image generation, emphasizing the efficiency and utility in diverse tasks.

Experimental Findings

Representation Learning:

  • Linear probing on representation learning tasks shows that UMD achieves competitive accuracy rates compared to state-of-the-art methods, particularly notable in closely rivaling MAE while drastically reducing training times.
  • Few-shot transfer learning experiments indicate that UMD captures robust representations conducive to generalizability across multiple out-of-distribution datasets.

Generative Performance:

  • For class-conditional image generation, UMD rivals conventional Diffusion Models like DiT, obtaining near-equivalent results in metrics such as Fréchet Inception Distance (FID) and Inception Score (IS).
  • UMD achieves favorable computational efficiency, requiring fewer GPU hours compared to its counterparts while maintaining similar or better performance metrics.

Implications and Future Work

UMD's implications span both theoretical and practical domains. The framework provides a clear pathway to combine generative and representational power within a single model, optimizing computation and potentially democratizing access to high-performing models in environments with limited computational resources. The results indicate that de-noising auto-encoders can be tuned to yield both strong generative and discriminative capabilities.

For future work, the authors suggest further exploration into dynamically learning the noise schedules to enhance the model's corruption and reconstruction processes. Additionally, extending the UMD framework to other data modalities beyond images—such as text or audio—could further validate its versatility and performance across broader application domains in AI.

In conclusion, the UMD framework offers substantial strides in resolving the dichotomy between generative modeling and self-supervised learning. This convergence can potentially lead to more efficient and versatile machine learning models, driving advancements in both theoretical methodologies and practical applications of AI.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube