Emergent Mind

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

(2305.16381)
Published May 25, 2023 in cs.LG and cs.CV

Abstract

Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though relatively simple approaches (e.g., rejection sampling based on reward scores) have been investigated, fine-tuning text-to-image models with the reward function remains challenging. In this work, we propose using online reinforcement learning (RL) to fine-tune text-to-image models. We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradient to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization. We conduct an analysis of KL regularization for both RL fine-tuning and supervised fine-tuning. In our experiments, we show that DPOK is generally superior to supervised fine-tuning with respect to both image-text alignment and image quality. Our code is available at https://github.com/google-research/google-research/tree/master/dpok.

Overview

  • The paper introduces DPOK, a method to fine-tune text-to-image diffusion models using online reinforcement learning and KL regularization.

  • DPOK aims to align generated images more closely with human preferences by maximizing a reward model trained on human feedback.

  • Experimental results show that DPOK outperforms supervised fine-tuning in text-to-image alignment and image quality.

  • DPOK can mitigate biases from pretrained models, evidenced by its performance in overcoming wrongly associated prompts.

  • The paper suggests further research on online RL fine-tuning for improving model reliability and responding to complex prompts.

Introduction

Diffusion models have significantly advanced AI-driven text-to-image generation, materializing textual descriptions into striking visual content. Notwithstanding such progress, these models, including renowned ones like Imagen, Dalle-2, and Stable Diffusion, confront challenges in creating images with intricate specifications such as exact object counts or specific colors. In the realm of reinforcement learning (RL), a promising direction has been adopting human feedback to refine models for better alignment with human preferences. This paper presents an innovative approach named DPOK (Diffusion Policy Optimization with KL regularization) that leverages online RL for fine-tuning text-to-image diffusion models.

Methodology

DPOK introduces online reinforcement learning to optimize the expected reward of generated images, aligning them with human evaluations. The method not only maximizes a reward model trained on human feedback but also employs KL divergence as a form of regularization, ensuring the fine-tuned model does not deviate excessively from the pretrained model's capabilities. The authors also delve into theoretical analyses, comparing KL regularization in online RL fine-tuning and supervised fine-tuning settings. DPOK is distinctive because it evaluates the reward model and conditional KL divergence beyond the supervised training dataset, giving it an empirical edge over traditional supervised fine-tuning.

Experimental Results

DPOK is empirically tested by fine-tuning the Stable Diffusion model using ImageReward, focusing on text-to-image alignment and retaining high image fidelity. Results demonstrate that DPOK generally outperforms supervised fine-tuning in both respects. Specifically, online RL fine-tuning leads to stronger text-image alignment, showcased by improved ImageReward scores, and retains high image quality, as evidenced by higher aesthetic scores. Moreover, human evaluations consistently favor the RL model over the supervised one in terms of both image-text alignment and image quality. A significant contribution is that DPOK also manages to mitigate biases inherent in the pretrained models, exemplified by its ability to override web-inculcated associations, such as the "Four roses" prompt being linked to whiskey rather than the flower.

Conclusion

DPOK marks a significant stride in the enhancement of text-to-image diffusion models through online RL fine-tuning. This technique exhibits a substantial improvement over supervised fine-tuning, not only optimizing for image-text alignment but also maintaining or even bettering the aesthetic quality of the generated images. It sets the stage for further exploration into efficient online RL fine-tuning techniques that could enable models to reliably generate highly complex and varied images while staying attuned to human judgment. The paper acknowledges potential limitations and calls for future research to address the efficiency and adaptability of fine-tuning with diverse prompts. It also points to the broader impacts of this work, emphasizing the necessity of thoroughly understanding reward models, as they now have increased sway over the fine-tuning process.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube