T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation (2307.06350v3)

Published 12 Jul 2023 in cs.CV

Abstract: Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal LLMs (MLLMs), i.e. GPT-4V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-${\alpha}$, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.

Citations (140)

View on Semantic Scholar

Summary

The paper introduces the T2I-CompBench benchmark with 6,000 prompts spanning attribute binding, object relationships, and complex compositions.
It proposes specialized evaluation metrics, including a unified 3-in-1 metric that aligns closely with human perception.
The study presents GORS, a novel fine-tuning method that enhances compositional text-to-image generation and outperforms prior approaches.

Introduction

Text-to-image (T2I) generation has experienced significant advancements through state-of-the-art models like Stable Diffusion and techniques involving generative adversarial networks and transformer architectures. Nevertheless, challenges remain in effectively composing objects with different attributes and relationships within complex scenes. Recognizing this limitation, the paper under discussion introduces T2I-CompBench, a substantial benchmark dedicated to open-world compositional T2I generation.

Benchmark and Evaluation Metrics

T2I-CompBench comprises 6,000 prompts, divided into three main categories: attribute binding, object relationships, and complex compositions, which are further broken down into color, shape, texture, spatial and non-spatial relationships. This work underscores the insufficiency of existing evaluation metrics. To address this gap, the authors present distinct metrics for each compositional category, such as disentangled BLIP-VQA for attribute binding and UniDet-based spatial relationship evaluation. Additionally, a unified 3-in-1 metric is introduced for complex prompts, combining the best performing metrics for each sub-category. These metrics have been empirically validated to align closely with human perception.

Generative Model Fine-tuning with GORS

The paper proposes Generative mOdel finetuning with Reward-driven Sample selection (GORS), a new method tailored to enhance compositional T2I generation. GORS applies a fine-tuning process where higher-quality generated images with compositional alignment are weighted more, refining the model's ability to handle complex scenes. Empirical results demonstrate the effectiveness of GORS, which not only outperforms existing methods but also displays better alignment with human judgments.

Discussion and Conclusion

This paper culminates in the proposal of T2I-CompBench as a benchmark and newly developed evaluation metrics that promise to better evaluate compositional imagery. The novel GORS method sets a new standard for improving compositional capabilities in T2I models, with quantitative and qualitative outcomes reinforcing its efficacy. Limitations of the paper include the lack of a unified evaluation metric across all types of compositions and the necessity to cautiously consider possible biases and negative impacts from model-generated content. The paper offers paths for future exploration, especially the development of a unified metric leveraging multimodal LLMs’ reasoning capabilities.