Emergent Mind

Abstract

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.

PRDP framework for stabilizing policy gradient methods by converting RLHF objective to supervised regression.

Overview

  • The paper introduces Proximal Reward Difference Prediction (PRDP) to stabilize and enhance the reward finetuning of diffusion models using large-scale datasets.

  • PRDP proposes a supervised regression task to predict reward differences between pairs of images, improving training stability and generalization compared to traditional RL-based approaches.

  • Experimental results show PRDP's superior performance in both small-scale and large-scale finetuning scenarios, significantly advancing the state of reward finetuning in vision-based generative models.

Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

The paper proposes Proximal Reward Difference Prediction (PRDP), a novel method designed to enable stable black-box reward finetuning of diffusion models on large-scale prompt datasets. This paper addresses the limitations of existing RL-based reward finetuning methods in the vision domain, particularly their instability in large-scale training scenarios, which compromises their ability to generalize to complex, unseen prompts.

Background

Diffusion models have demonstrated significant success in generative modeling of continuous data, including photorealistic text-to-image synthesis. However, their maximum likelihood training objective often misaligns with downstream requirements like generating novel object compositions and aesthetically preferred images. In the language domain, Reinforcement Learning from Human Feedback (RLHF) has been adopted to align language models with human preferences, showing notable success. Inspired by this, analogous reward models have been developed for the vision domain, such as HPSv2 and PickScore. Yet, applying RL-based finetuning methods to diffusion models, as attempted by methods like DDPO, reveals inherent instability in large-scale setups.

Proximal Reward Difference Prediction (PRDP)

PRDP is introduced to address the instability of RL-based approaches by proposing a supervised regression objective for finetuning diffusion models, named Reward Difference Prediction (RDP). The innovation centers on a regression task where the diffusion model predicts the reward difference between pairs of generated images, provided from their denoising trajectories. The paper theoretically establishes that achieving perfect reward difference prediction results in diffusion models that maximize the RL objective.

Key Contributions

  • RDP Objective: The RDP objective is designed to inherit the optimal solution of the RL objective while providing enhanced training stability. This is formulated as a supervised learning task, predicting reward differences between image pairs generated from text prompts.
  • Proximal Updates: To mitigate training instability, the authors propose proximal updates inspired by Proximal Policy Optimization (PPO). This involves clipping the log probability ratios to ensure stable and bounded optimization steps.
  • Online Optimization Algorithm: To enhance training stability and performance, the authors employ an online optimization strategy where diffusion models are updated iteratively while sampling new data points, avoiding the pitfalls of static datasets.

Experimental Validation

PRDP is evaluated through a series of experiments:

  1. Small-Scale Finetuning: The method is tested on a dataset of 45 prompts with HPSv2 and PickScore as reward models. PRDP matches and slightly exceeds the performance of DDPO, demonstrating its efficacy in small-scale settings.
  2. Large-Scale Finetuning: The research demonstrates PRDP's capability to handle over 100K prompts from the Human Preference Dataset v2 (HPDv2) and achieves superior generation quality on previously unseen prompts. PRDP maintains stability where DDPO fails.
  3. Multi-Reward Finetuning: Comprehensive training on mixed rewards showcases PRDP's superiority in generating higher quality images under complex and diverse prompt sets.

Implications and Future Work

The findings of this research have substantial implications for the practical application and theoretical understanding of diffusion models in generative tasks. PRDP's robustness and stability in large-scale finetuning contexts suggest broad applicability in domains requiring high-quality, diverse image generation. Additionally, the integration of supervised learning concepts to approach traditionally RL-driven objectives may inspire further novel strategies across different areas of AI model finetuning.

Future developments may explore further optimization techniques or hybrid approaches that blend the benefits of supervised learning stability with refined RL techniques. Researchers might also investigate extending PRDP to other model architectures and additional data modalities, thereby expanding its utility and impact.

Conclusion

PRDP represents a significant step forward in stable large-scale reward finetuning of diffusion models, offering a practical and theoretically sound alternative to RL-based methods. By converting the RLHF objective into a supervised learning task and incorporating proximal updates, PRDP provides a scalable, stable solution for enhancing diffusion models under broad and complex generative tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube