Emergent Mind

Abstract

Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation and explore the potential and limitations of multimodal LLMs for evaluation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.

Overview

  • The paper introduces T2I-CompBench, a new benchmark for evaluating open-world compositional text-to-image generation.

  • A set of distinct evaluation metrics tailored for different compositional aspects is presented, aiming to better align with human perceptual judgment.

  • Generative mOdel finetuning with Reward-driven Sample selection (GORS) is proposed to enhance the compositional ability of text-to-image generation models.

  • The study recognizes limitations such as the absence of a unified evaluation metric and the need for caution against potential biases in generated content.

Introduction

Text-to-image (T2I) generation has experienced significant advancements through state-of-the-art models like Stable Diffusion and techniques involving generative adversarial networks and transformer architectures. Nevertheless, challenges remain in effectively composing objects with different attributes and relationships within complex scenes. Recognizing this limitation, the paper under discussion introduces T2I-CompBench, a substantial benchmark dedicated to open-world compositional T2I generation.

Benchmark and Evaluation Metrics

T2I-CompBench comprises 6,000 prompts, divided into three main categories: attribute binding, object relationships, and complex compositions, which are further broken down into color, shape, texture, spatial and non-spatial relationships. This work underscores the insufficiency of existing evaluation metrics. To address this gap, the authors present distinct metrics for each compositional category, such as disentangled BLIP-VQA for attribute binding and UniDet-based spatial relationship evaluation. Additionally, a unified 3-in-1 metric is introduced for complex prompts, combining the best performing metrics for each sub-category. These metrics have been empirically validated to align closely with human perception.

Generative Model Fine-tuning with GORS

The paper proposes Generative mOdel finetuning with Reward-driven Sample selection (GORS), a new method tailored to enhance compositional T2I generation. GORS applies a fine-tuning process where higher-quality generated images with compositional alignment are weighted more, refining the model's ability to handle complex scenes. Empirical results demonstrate the effectiveness of GORS, which not only outperforms existing methods but also displays better alignment with human judgments.

Discussion and Conclusion

This study culminates in the proposal of T2I-CompBench as a benchmark and newly developed evaluation metrics that promise to better evaluate compositional imagery. The novel GORS method sets a new standard for improving compositional capabilities in T2I models, with quantitative and qualitative outcomes reinforcing its efficacy. Limitations of the study include the lack of a unified evaluation metric across all types of compositions and the necessity to cautiously consider possible biases and negative impacts from model-generated content. The paper offers paths for future exploration, especially the development of a unified metric leveraging multimodal LLMs’ reasoning capabilities.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.