Rich Human Feedback for Text-to-Image Generation (2312.10240v2)

Published 15 Dec 2023 in cs.CV

Abstract: Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for LLMs, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k.

References (62)

Citations (24)

View on Semantic Scholar

Summary

The paper introduces the RichHF-18K dataset and RAHF model to predict nuanced human feedback annotations for text-to-image generation.
It leverages a multimodal transformer architecture to assess image plausibility, misalignment, and aesthetic quality with fine-grained scores.
The study demonstrates that refining image generation via predicted feedback reduces artifacts and improves overall visual quality.

The paper "Rich Human Feedback for Text-to-Image Generation" (Rich Human Feedback for Text-to-Image Generation, 2023) introduces a novel approach to enhance Text-to-Image (T2I) generation by incorporating rich human feedback. The authors address the limitations of current T2I models, which often produce images with artifacts, text misalignment, and low aesthetic quality, and the shortcomings of existing evaluation metrics that fail to capture nuanced image quality aspects.

To address these issues, the authors make the following contributions:

RichHF-18K Dataset: They created a dataset of rich human feedback on 18,000 images, termed RichHF-18K, annotated with:
- Point annotations marking implausibility, artifacts, and text-image misalignment.
- Labels on text prompts identifying misrepresented or missing concepts.
- Fine-grained scores assessing image plausibility, text-image alignment, aesthetics, and overall quality.
RAHF (Rich Automatic Human Feedback) Model: The authors designed a multimodal transformer model, RAHF (Rich Automatic Human Feedback), to predict the rich human annotations. This model predicts implausibility and misalignment regions, misaligned keywords, and fine-grained scores, offering detailed insights into image quality.
Improving Image Generation: The predicted rich human feedback from RAHF is leveraged to enhance image generation through:
- Inpainting problematic image regions using predicted heatmaps as masks.
- Finetuning image generation models by selecting high-quality training data based on predicted scores. The authors demonstrate improvements on the Muse model [chang2023muse], even though it was not used to generate the images in the training set, indicating good generalization.

The paper details the data collection process for RichHF-18K, where annotators marked implausibility and misalignment regions on images, labeled misaligned keywords in the prompts, and assigned scores for various quality aspects. To ensure reliability, each image-text pair was annotated by three annotators, and the annotations were consolidated through averaging scores, majority voting for keywords, and averaging heatmaps.

The architecture of the RAHF model consists of a vision stream (ViT) and a text stream. The image tokens and embedded text tokens are concatenated and encoded by a Transformer self-attention encoder. The model employs predictors for heatmap prediction (convolution and deconvolution layers), score prediction (convolution and linear layers), and keyword misalignment sequence prediction (Transformer decoder). Two model variants are explored: a multi-head version with separate prediction heads for each output and an augmented prompt version that prepends a task string to the prompt.

The experimental results demonstrate that the RAHF model can predict scores, implausibility heatmaps, misalignment heatmaps, and misalignment keyword sequences with reasonable accuracy. The augmented prompt version generally performs better than the multi-head version, as it allows the model to adapt to each specific task. Qualitative examples illustrate the model's ability to identify artifact regions and objects misaligned with the prompt.

The authors further demonstrate that the predicted rich human feedback can be used to improve image generation. Finetuning the Muse model [chang2023muse] with examples selected based on predicted plausibility scores leads to images with fewer artifacts. Using the RAHF aesthetic score as classifier guidance for Latent Diffusion also improves the generated images. Additionally, the predicted heatmaps are used to perform region inpainting, resulting in more plausible images with fewer artifacts.

The loss function for training the model is a weighted combination of the heatmap Mean Squared Error (MSE) loss, score MSE loss, and the sequence teacher-forcing cross-entropy loss.

$Loss = \lambda_{heatmap} * MSE_{heatmap} + \lambda_{score} * MSE_{score} + \lambda_{sequence} * CrossEntropy_{sequence}$

Where:

$Loss$ is the total loss
$\lambda_{heatmap}$ is the weight for the heatmap loss
$MSE_{heatmap}$ is the mean squared error for heatmap prediction
$\lambda_{score}$ is the weight for the score loss
$MSE_{score}$ is the mean squared error for score prediction
$\lambda_{sequence}$ is the weight for the sequence loss
$CrossEntropy_{sequence}$ is the cross-entropy loss for sequence prediction

The authors acknowledge limitations, including the lower performance on misalignment heatmap prediction and the over-annotation issue in artifact region annotation. They suggest future research directions such as improving misalignment label quality, collecting more data on diverse generative models, and exploring other ways to leverage rich human feedback to enhance T2I generation.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/GoogleAI/status/1803953513040576763

https://twitter.com/441465751/status/1736994801017672061

https://twitter.com/CVCND/status/1803758626156216760

https://twitter.com/InnovationRapid/status/1808592128471179337

https://twitter.com/hamadakoichi/status/1803489798235255273

https://twitter.com/1487345386226614278/status/1737255481125790158

YouTube

Show All Videos