Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization (2406.04314v3)

Published 6 Jun 2024 in cs.CV

Abstract: Generating visually appealing images is fundamental to modern text-to-image generation models. A potential solution to better aesthetics is direct preference optimization (DPO), which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 3) randomly select one from the pool to initialize the next denoising step. This strategy ensures that diffusion models focus on the subtle, fine-grained visual differences instead of layout aspect. We find that aesthetics can be significantly enhanced by accumulating these improved minor differences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields significant improvements in aesthetics compared with existing DPO methods while not sacrificing image-text alignment compared with vanilla models. Moreover, SPO converges much faster than DPO methods due to the use of more correct preference labels provided by the step-aware preference model.

Citations (12)

View on Semantic Scholar

Summary

The paper demonstrates that misalignment in step-level denoising supervision undermines current DPO techniques.
It introduces SPO, a novel post-training approach that adjusts performance at each denoising step using a step-aware preference model and resampler.
Empirical results show SPO improves training efficiency by over 20× and yields images with enhanced alignment to textual prompts.

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

The paper authored by Zhanhao Liang et al., proposed a method known as Step-aware Preference Optimization (SPO) for fine-tuning text-to-image diffusion models. This work addresses the limitations of current Direct Preference Optimization (DPO) methods when applied to such models. Specifically, the authors pointed out that existing DPO methods assign a single, trajectory-level preference label to all intermediate generation steps, which overlooks the step-specific denoising performance.

Summary of Contributions

The primary contributions of the paper are as follows:

Identification of Misalignment in DPO: The authors noted that diffusion models generate images through multiple denoising steps, each with distinct contributions. Existing DPO methods assume a consistent preference label across all steps, leading to suboptimal performance due to misaligned supervision signals.
Introduction of SPO: To address this misalignment, the authors introduced SPO, a novel post-training approach. SPO independently evaluates and adjusts denoising performance at each step using a step-aware preference model and a step-wise resampler.
Step-aware Preference Model: The paper proposed a step-aware preference model that assesses the quality of images at each denoising step by taking timestep-specific conditions into account. This model is trained to predict preferences for noisy intermediate steps, aligning preferences more accurately with the denoising performance.
Step-wise Resampler: The authors designed a step-wise resampler process that ensures win-lose comparisons are independent of previous steps. This mechanism refines the training process by preventing trajectory-level dependencies.
Empirical Validation: The empirical results demonstrated significant improvements in aligning generated images with detailed prompts and enhancing their aesthetics. SPO exhibited over 20× improvement in training efficiency compared to the latest Diffusion-DPO methods, establishing a notable advancement in both fine-tuning performance and efficiency.

Experimental Results

The evaluation was conducted using Stable Diffusion's versions SD-1.5 and SDXL. The improvements were validated across four AI feedback models: PickScore, HPSV2, ImageReward, and Aesthetic. The results highlighted that:

In Table \ref{tab:prompt_alignment_matrix_sdxl}, SPO consistently outperformed Diffusion-DPO on SDXL across all metrics.
Table \ref{tab:auto_eval_matrix_sd15} demonstrated similar enhancements for SD-1.5, with SPO achieving the best performance across all feedback models.

Moreover, a user paper confirmed that images generated using SPO were generally preferred over those generated by Diffusion-DPO, particularly in terms of visual appeal and general preference.

Implications and Future Directions

The approach presented in this paper offers significant implications for both practical and theoretical advancements in AI-driven image generation:

Practical Implications: SPO’s ability to align the denoising performance with human preferences on a step-by-step basis can be instrumental in applications requiring high-fidelity image generation from complex textual descriptions. This has direct applications in content creation, visual storytelling, and graphic design.
Theoretical Implications: The findings underline the importance of considering step-wise contributions in multi-stage generation processes. This may prompt further research into more granular optimization techniques, not just in diffusion models but extending to other multi-step generative tasks.
Efficiency Gains: The demonstrated efficiency of SPO—achieving the same or better quality with significantly reduced training time—suggests that similar step-aware approaches could be applied to other costly training problems in AI, potentially catalyzing broader use and accessibility.

Speculations for Future Research

Future developments might explore the integration of SPO with other state-of-the-art models, or its application in different domains such as video generation or multi-modal generation tasks. Moreover, investigating the extension of SPO's principles to reinforcement learning scenarios could provide insights into optimizing agent behaviors over multiple decision-making steps.

By incorporating step-specific supervision, this research paves the way for more nuanced, efficient, and user-aligned generative models, expanding the horizons of what AI can achieve in creative domains.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/RainbowYuhui/status/1798895688518172944

https://twitter.com/_akhaliq/status/1798920423083499955

https://twitter.com/wendlerch/status/1934100902920823070