Emergent Mind

Abstract

We present a comprehensive solution to learn and improve text-to-image models from human preference feedback. To begin with, we build ImageReward -- the first general-purpose text-to-image human preference reward model -- to effectively encode human preferences. Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, ImageReward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. On top of it, we propose Reward Feedback Learning (ReFL), a direct tuning algorithm to optimize diffusion models against a scorer. Both automatic and human evaluation support ReFL's advantages over compared methods. All code and datasets are provided at \url{https://github.com/THUDM/ImageReward}.

Overview

  • ImageReward is a reward model developed for aligning text-to-image generative models with human preferences, using a large dataset of expert comparisons.

  • The model is trained through a rigorous annotation process that includes rating and ranking, as well as considerations of alignment, fidelity, and harmlessness.

  • ImageReward has been shown to outperform existing models such as CLIP, Aesthetic, and BLIP by large margins in terms of aligning with human preference.

  • ReFL (Reward Feedback Learning) is a new technique to adjust diffusion generative models using feedback from a reward scorer, proving more effective than other methods like data augmentation and loss reweighing.

  • Despite potential limitations concerning the variety of annotation data and the singular model approach, the benefits of ImageReward and ReFL include ethical training and adherence to social norms.

Introduction

Recent years have witnessed a significant surge in the capabilities of text-to-image generative models. These models have become adept at creating images that are both high-fidelity and semantically related to the corresponding text prompts. However, a primary challenge for these systems is to alight model outputs with human preferences, as the training distribution often does not reflect the true distribution of user-generated prompts.

ImageReward

Addressing the need for enhanced alignment with human preference, ImageReward emerges as a pioneering general-purpose reward model for text-to-image synthesis. It incorporates human preferences effectively, trained on a substantial dataset consisting of 137k pairs of expert comparisons. The training benefits from a meticulously crafted annotation pipeline that encompasses rating and ranking. The stringent annotation process involves prompt categorization, problem identification, and multi-dimensional scoring based on alignment, fidelity, and harmlessness. This endeavor has necessitated months of effort in establishing labeling criteria, training experts, and ensuring the reliability of responses.

Evaluating ImageReward's Efficacy

ImageReward demonstrates superiority over existing methodologies such as CLIP, Aesthetic, and BLIP scoring models. It outperforms these models by significant margins (38.6% over CLIP, 39.6% over Aesthetic, and 31.6% over BLIP) in grasping human preference in synthesized images. The capability to align closely with human preferences is further validated through exhaustive analysis and experiments. Additionally, ImageReward shows notable potential as an automatic evaluation metric for text-to-image generation tasks.

ReFL: Reward Feedback Learning

ReFL, Reward Feedback Learning, is introduced as an innovation to directly fine-tune diffusion generative models in light of a reward scorer's feedback. This unique approach leverages insights gained from the evaluation of image quality at late denoising steps of the generative process. Empirical evaluations acknowledge the superiority of ReFL, notably over alternative approaches, such as data augmentation and loss reweighing.

Conclusion and Broader Impact

ImageReward and ReFL collectively represent a significant stride in aligning generative models with human values and preferences. While there are acknowledged limitations, including the scale and diversity of annotation data, the use of a single reward model might not encapsulate the multiplicity of human aesthetics. Nevertheless, the advantages, such as mitigating the over-reliance on data with copyright issues for training and conforming to social norms, significantly outweigh the downsides.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.