Emergent Mind

Abstract

Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from "reward hacking" and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-$\alpha$, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time. Code is available at https://github.com/ExplainableML/ReNO.

Comparison of four Text-to-Image models with and without ReNO for different prompts.

Overview

  • The paper introduces ReNO, a novel approach for optimizing Text-to-Image (T2I) models by using human preference reward models to optimize initial noise.

  • ReNO improves image quality and prompt adherence without changing model parameters, leveraging reward models like ImageReward, PickScore, HPSv2, and CLIPScore.

  • Empirical results show that ReNO significantly enhances performance on benchmarks, suggesting practical applicability and promise for further advancements and ethical integration in generative AI models.

Enhancing Text-to-Image Models through Reward-based Noise Optimization

The paper "ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization" introduces a novel approach to improving Text-to-Image (T2I) models by optimizing the initial noise based on human preference reward models. Text-to-Image models have shown impressive progress, but they still face significant challenges in capturing intricate details in complex compositional prompts. This work proposes Reward-based Noise Optimization (ReNO) to address these challenges, demonstrating significant improvements without altering the model parameters. The research leverages four reward models—ImageReward, PickScore, HPSv2, and CLIPScore—combining their strengths to robustly guide T2I models in generating high-quality, prompt-aligned images.

Contribution and Methodology

The primary contribution is the introduction of ReNO, which optimizes the initial noise vector at inference time to enhance image quality and prompt adherence without significant computational overhead. The authors critically assess traditional methods, highlighting their inefficiencies and limitations, such as reward hacking and high computational costs. ReNO sidesteps these issues by focusing on optimizing the initial noise.

The methodology involves:

  1. One-step Diffusion Models: Utilizing well-distilled one-step T2I models to maintain computational efficiency.
  2. Reward-Based Noise Optimization: Iteratively refining the initial noise vector through gradient ascent, leveraging signals from multiple reward models to improve image generation.
  3. Noise Regularization: Ensuring that the optimized noise remains within a reasonable distribution to prevent collapse and preserve semantic integrity.

Numerical Results and Implications

The results on T2I-CompBench and GenEval benchmarks showcase substantial improvements. For instance, applying ReNO on SD-Turbo achieves a 20+ percentage point increase in categories like Color and Texture. These models exhibit a performance close to or surpassing proprietary models such as DALL-E 3. In a computational budget of 20-50 seconds per image, ReNO demonstrates a practical application, rendering it an efficient solution even for high-demand scenarios.

Furthermore, user studies on Parti-Prompts confirm ReNO’s superiority in both image aesthetics and prompt faithfulness. This suggests a balanced enhancement, addressing both typical quantitative metrics and subjective user preferences.

Theoretical and Practical Implications

The research elucidates the critical role of initial noise in T2I models and presents a robust framework for leveraging reward models. Theoretically, it raises important questions about the distribution and manipulation of noise in generative models, opening avenues for further exploration in model optimization and generative adversarial training.

Practically, ReNO's efficiency and effectiveness suggest immediate applicability in various settings, from artistic generation to automated content creation. The approach’s balance of improving both compositional accuracy and visual appeal makes it promising for commercial deployment, especially in creative industries where quality and detail are paramount.

Future Developments in AI

Looking ahead, this research paves the way for further enhancements in T2I models through optimized reward models. Improving the robustness and generalization capabilities of reward models themselves could further amplify the benefits observed with ReNO. Additionally, integrating safety and fairness objectives into the reward models could address ethical considerations, ensuring responsible use and deployment of advanced generative models.

In conclusion, ReNO represents a significant step forward in optimizing Text-to-Image models, offering an efficient and scalable solution that significantly enhances image quality and prompt adherence without extensive computational demands. This work not only showcases the immediate benefits but also sets the stage for future innovations in AI-driven image generation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.