Emergent Mind

Rich Human Feedback for Text-to-Image Generation

(2312.10240)
Published Dec 15, 2023 in cs.CV

Abstract

Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for LLMs, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.

Overview

  • The paper addresses the limitations of Text-to-Image (T2I) generation models and the need for better feedback mechanisms.

  • A new method for collecting detailed human feedback has been introduced, resulting in the RichHF-18K dataset.

  • A Rich Automatic Human Feedback (RAHF) multimodal transformer has been developed to predict human feedback on images.

  • The RAHF model's predictions highly correlate with human annotations and can improve T2I model performance.

  • Researchers showcased the use of RAHF in inpainting problematic regions and finetuning generative models, enhancing image quality.

Background

The rapid progress in Text-to-Image (T2I) generation models like StableDiffusion and Imagen has opened up new avenues in various creative domains. Despite advancements, many generated images suffer from unrealistic artifacts, misalignment with text descriptions, and subpar aesthetic qualities. Traditional evaluation metrics often fail to capture these nuances, leading to a need for more refined feedback mechanisms to improve T2I generation models.

Rich Human Feedback Data Collection

In this study, a new approach to collecting detailed human feedback has been implemented. Instead of simply scoring images, the feedback includes:

  • Image region annotations that highlight implausible areas or segments that don't match the text.
  • Text annotations that mark words in the prompt which are misrepresented or not depicted in the image.
  • Fine-grained scores for plausibility, text-image alignment, aesthetic appeal, and overall quality.

A large-scale dataset (RichHF-18K) has been created, encompassing rich annotations on 18,000 images. Not only does this dataset allow for more comprehensive evaluation, but it also serves as training data to finetune and improve existing T2I models.

Predictive Model Development

Researchers developed a multimodal transformer to predict the detailed feedback provided by humans automatically. This Rich Automatic Human Feedback (RAHF) model can do the following:

  • Predict regions of implausibility and misalignment in an image.
  • Identify mismatched or missing concepts in the text prompts.
  • Assign fine-grained scores evaluating various image quality aspects.

Testing shows that the RAHF model's predictions correlate highly with human annotations. Furthermore, these predictive capabilities can enhance image generation by guiding the selection of training data or refining the generative process.

Applications and Improvements in Image Generation

The RAHF model has been used to facilitate improvements in T2I models by two methods:

  1. Inpainting Problematic Regions: By creating masks from the predicted heatmaps, problematic sections of images can be inpainted, leading to higher-quality results.
  2. Finetuning Generative Models: Using predicted scores, researchers can select and filter data to better train generative models like Muse.

Conclusion

This paper presents the first comprehensive dataset for fine-grained human feedback in text-to-image generation and introduces a powerful tool to predict and apply this feedback for model enhancement. Its generalization abilities are exhibited by the model's success beyond the datasets it was trained on, promising a future where T2I quality aligns more closely with human judgment.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube