Emergent Mind

Abstract

Recently, Direct Preference Optimization (DPO) has extended its success from aligning LLMs to aligning text-to-image diffusion models with human preferences. Unlike most existing DPO methods that assume all diffusion steps share a consistent preference order with the final generated images, we argue that this assumption neglects step-specific denoising performance and that preference labels should be tailored to each step's contribution. To address this limitation, we propose Step-aware Preference Optimization (SPO), a novel post-training approach that independently evaluates and adjusts the denoising performance at each step, using a step-aware preference model and a step-wise resampler to ensure accurate step-aware supervision. Specifically, at each denoising step, we sample a pool of images, find a suitable win-lose pair, and, most importantly, randomly select a single image from the pool to initialize the next denoising step. This step-wise resampler process ensures the next win-lose image pair comes from the same image, making the win-lose comparison independent of the previous step. To assess the preferences at each step, we train a separate step-aware preference model that can be applied to both noisy and clean images. Our experiments with Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms the latest Diffusion-DPO in aligning generated images with complex, detailed prompts and enhancing aesthetics, while also achieving more than 20x times faster in training efficiency. Code and model: https://rockeycoss.github.io/spo.github.io/

High-quality images produced by SDXL fine-tuned with step-aware preference optimization.

Overview

  • The paper identifies limitations in current Direct Preference Optimization (DPO) methods for text-to-image diffusion models, particularly the misalignment of supervision signals across different denoising steps.

  • It introduces Step-aware Preference Optimization (SPO), a novel approach that independently evaluates and adjusts denoising performance at each step, using a step-aware preference model and a step-wise resampler.

  • Empirical results show significant improvement in aligning generated images with detailed prompts and enhancing their aesthetics, demonstrating over 20x improvement in training efficiency compared to existing methods.

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

The paper authored by Zhanhao Liang et al., proposed a method known as Step-aware Preference Optimization (SPO) for fine-tuning text-to-image diffusion models. This work addresses the limitations of current Direct Preference Optimization (DPO) methods when applied to such models. Specifically, the authors pointed out that existing DPO methods assign a single, trajectory-level preference label to all intermediate generation steps, which overlooks the step-specific denoising performance.

Summary of Contributions

The primary contributions of the paper are as follows:

  1. Identification of Misalignment in DPO: The authors noted that diffusion models generate images through multiple denoising steps, each with distinct contributions. Existing DPO methods assume a consistent preference label across all steps, leading to suboptimal performance due to misaligned supervision signals.
  2. Introduction of SPO: To address this misalignment, the authors introduced SPO, a novel post-training approach. SPO independently evaluates and adjusts denoising performance at each step using a step-aware preference model and a step-wise resampler.
  3. Step-aware Preference Model: The paper proposed a step-aware preference model that assesses the quality of images at each denoising step by taking timestep-specific conditions into account. This model is trained to predict preferences for noisy intermediate steps, aligning preferences more accurately with the denoising performance.
  4. Step-wise Resampler: The authors designed a step-wise resampler process that ensures win-lose comparisons are independent of previous steps. This mechanism refines the training process by preventing trajectory-level dependencies.
  5. Empirical Validation: The empirical results demonstrated significant improvements in aligning generated images with detailed prompts and enhancing their aesthetics. SPO exhibited over 20× improvement in training efficiency compared to the latest Diffusion-DPO methods, establishing a notable advancement in both fine-tuning performance and efficiency.

Experimental Results

The evaluation was conducted using Stable Diffusion's versions SD-1.5 and SDXL. The improvements were validated across four AI feedback models: PickScore, HPSV2, ImageReward, and Aesthetic. The results highlighted that:

  • In Table \ref{tab:promptalignmentmatrix_sdxl}, SPO consistently outperformed Diffusion-DPO on SDXL across all metrics.
  • Table \ref{tab:autoevalmatrix_sd15} demonstrated similar enhancements for SD-1.5, with SPO achieving the best performance across all feedback models.

Moreover, a user study confirmed that images generated using SPO were generally preferred over those generated by Diffusion-DPO, particularly in terms of visual appeal and general preference.

Implications and Future Directions

The approach presented in this paper offers significant implications for both practical and theoretical advancements in AI-driven image generation:

  • Practical Implications: SPO’s ability to align the denoising performance with human preferences on a step-by-step basis can be instrumental in applications requiring high-fidelity image generation from complex textual descriptions. This has direct applications in content creation, visual storytelling, and graphic design.
  • Theoretical Implications: The findings underline the importance of considering step-wise contributions in multi-stage generation processes. This may prompt further research into more granular optimization techniques, not just in diffusion models but extending to other multi-step generative tasks.
  • Efficiency Gains: The demonstrated efficiency of SPO—achieving the same or better quality with significantly reduced training time—suggests that similar step-aware approaches could be applied to other costly training problems in AI, potentially catalyzing broader use and accessibility.

Speculations for Future Research

Future developments might explore the integration of SPO with other state-of-the-art models, or its application in different domains such as video generation or multi-modal generation tasks. Moreover, investigating the extension of SPO's principles to reinforcement learning scenarios could provide insights into optimizing agent behaviors over multiple decision-making steps.

By incorporating step-specific supervision, this research paves the way for more nuanced, efficient, and user-aligned generative models, expanding the horizons of what AI can achieve in creative domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.