ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization (2406.04312v1)

Published 6 Jun 2024 in cs.CV

Abstract: Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from "reward hacking" and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-$\alpha$, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time. Code is available at https://github.com/ExplainableML/ReNO.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces ReNO, which refines the initial noise vector using multiple reward models to boost image quality and adherence.
It employs gradient ascent in one-step diffusion models, achieving over 20 percentage point improvements on key visual metrics.
The method offers a computationally efficient alternative, promising practical enhancements for commercial AI-driven image generation.

Enhancing Text-to-Image Models through Reward-based Noise Optimization

The paper "ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization" introduces a novel approach to improving Text-to-Image (T2I) models by optimizing the initial noise based on human preference reward models. Text-to-Image models have shown impressive progress, but they still face significant challenges in capturing intricate details in complex compositional prompts. This work proposes Reward-based Noise Optimization (ReNO) to address these challenges, demonstrating significant improvements without altering the model parameters. The research leverages four reward models—ImageReward, PickScore, HPSv2, and CLIPScore—combining their strengths to robustly guide T2I models in generating high-quality, prompt-aligned images.

Contribution and Methodology

The primary contribution is the introduction of ReNO, which optimizes the initial noise vector at inference time to enhance image quality and prompt adherence without significant computational overhead. The authors critically assess traditional methods, highlighting their inefficiencies and limitations, such as reward hacking and high computational costs. ReNO sidesteps these issues by focusing on optimizing the initial noise.

The methodology involves:

One-step Diffusion Models: Utilizing well-distilled one-step T2I models to maintain computational efficiency.
Reward-Based Noise Optimization: Iteratively refining the initial noise vector through gradient ascent, leveraging signals from multiple reward models to improve image generation.
Noise Regularization: Ensuring that the optimized noise remains within a reasonable distribution to prevent collapse and preserve semantic integrity.

Numerical Results and Implications

The results on T2I-CompBench and GenEval benchmarks showcase substantial improvements. For instance, applying ReNO on SD-Turbo achieves a 20+ percentage point increase in categories like Color and Texture. These models exhibit a performance close to or surpassing proprietary models such as DALL-E 3. In a computational budget of 20-50 seconds per image, ReNO demonstrates a practical application, rendering it an efficient solution even for high-demand scenarios.

Furthermore, user studies on Parti-Prompts confirm ReNO’s superiority in both image aesthetics and prompt faithfulness. This suggests a balanced enhancement, addressing both typical quantitative metrics and subjective user preferences.

Theoretical and Practical Implications

The research elucidates the critical role of initial noise in T2I models and presents a robust framework for leveraging reward models. Theoretically, it raises important questions about the distribution and manipulation of noise in generative models, opening avenues for further exploration in model optimization and generative adversarial training.

Practically, ReNO's efficiency and effectiveness suggest immediate applicability in various settings, from artistic generation to automated content creation. The approach’s balance of improving both compositional accuracy and visual appeal makes it promising for commercial deployment, especially in creative industries where quality and detail are paramount.

Future Developments in AI

Looking ahead, this research paves the way for further enhancements in T2I models through optimized reward models. Improving the robustness and generalization capabilities of reward models themselves could further amplify the benefits observed with ReNO. Additionally, integrating safety and fairness objectives into the reward models could address ethical considerations, ensuring responsible use and deployment of advanced generative models.

In conclusion, ReNO represents a significant step forward in optimizing Text-to-Image models, offering an efficient and scalable solution that significantly enhances image quality and prompt adherence without extensive computational demands. This work not only showcases the immediate benefits but also sets the stage for future innovations in AI-driven image generation.

Related Papers

GitHub

GitHub - ExplainableML/ReNO: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

Tweets

https://twitter.com/LucaEyring/status/1799124898520113251

https://twitter.com/LucaEyring/status/1866976753635430401

https://twitter.com/ExplainableML/status/1935274574087651334

Reddit

ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization (Optimizing the noise of one-step models with rewards to get much better images) (57 points, 12 comments)