Emergent Mind

Abstract

With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness-the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented against few weak baselines by correlation to human Likert scores over a set of easy-to-discriminate images. We introduce T2IScoreScore (TS2), a curated set of semantic error graphs containing a prompt and a set increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

Evaluation metrics for Text-to-Image tasks assess image organization in a semantic error graph.

Overview

  • The paper introduces , a set of semantic error graphs (SEGs) and meta-metrics to objectively assess text-to-image (T2I) prompt faithfulness metrics.

  • The study underscores the lack of a standardized benchmark for evaluating T2I prompt coherence metrics, highlighting the need through a broad survey of existing benchmarks.

  • adopts a unique structure with high image-to-prompt ratios to facilitate the construction of SEGs, enabling comprehensive metric evaluations.

  • experiments assess various T2I prompt faithfulness metrics, revealing that simpler feature-based metrics like CLIPScore compete well with sophisticated vision-language model (VLM)-based metrics.

Who Evaluates the Evaluations? A Benchmark for Text-to-Image Prompt Coherence Metrics

Introduction

The landscape of text-to-image (T2I) models has witnessed rapid advancements, propelling the fidelity and semantic coherence of generated images to unprecedented levels. Despite this progress, the challenge of aligning generated images with their text prompts—a cornerstone for evaluating T2I model performance—persists. The heterogeneity among proposed automated prompt faithfulness metrics, developed to measure this alignment, underscores the pressing need for a standardized benchmark. This study introduces , a meticulously curated set of semantic error graphs (SEGs) and corresponding meta-metrics, aiming to objectively assess the efficacy of various T2I prompt faithfulness metrics.

Related Work

A broad survey of existing benchmarks reveals a disjointed landscape where each metric employs a distinct evaluation methodology, often designed to highlight its strengths. While ad-hoc tests against prior baselines are common, they fall short in offering a consistent or objective comparison framework. Our investigation emphasizes the gap in objective benchmarks that rigorously compare T2I prompt coherence metrics based on clearly defined errors, rather than subjective human judgment correlating metrics.

The Dataset

distinguishes itself through a unique structure that emphasizes high image-to-prompt ratios. This design facilitates the construction of semantic error graphs (SEGs) where images are organized based on increasing deviation from the original prompt. The dataset comprises 165 SEGs, covering a spectrum from synthetic errors to natural misinterpretations, thereby setting the stage for comprehensive metric evaluations.

Meta-Metrics

The cornerstone of our evaluation framework lies in two novel meta-metrics: Ranking Correctness Assessment (Ordering) and Separation Assessment. The former leverages Spearman's rank correlation to assess a metric's ability to correctly order images by their semantic deviation from the prompt. Meanwhile, the Separation metric employs the two-sample Kolmogorov–Smirnov statistic to evaluate the capability to differentiate between sets of images reflecting unique semantic errors. Together, these meta-metrics provide a robust measure of a T2I prompt faithfulness metric's performance.

Experiments

Our experiments span a broad spectrum of T2I faithfulness benchmarks, evaluating each against the newly proposed . The study showcases a comparative analysis across various metric classes, including embedding-based metrics like CLIPScore and novel vision-language model (VLM)-based metrics such as TIFA and DSG. The results reveal intriguing findings; surprisingly, simpler feature-based metrics like CLIPScore display competitive performance, especially in challenging error subsets. This observation suggests the potential for feature-based metrics to provide a valuable benchmark alongside more sophisticated VLM-based approaches.

Discussion and Conclusion

The comparative analysis offered by yields critical insights into the current state of T2I prompt coherence metric development. Notably, the performance of simpler metrics in the face of complex, naturally-occurring model errors highlights a path forward for metric development focused not just on aligning with human judgment but also on objective semantic error identification. Our research emphasizes the necessity of bridging the gap between subjective preference and objective error-based evaluation, advocating for a multifaceted approach to metric development. As the T2I field continues to evolve, stands as a pivotal benchmark tool, guiding the refinement of evaluation metrics toward more accurate, reliable, and semantically coherent image generation.

Acknowledgements and Impact Statement

The research highlights the indispensable role of precise evaluation tools like in advancing T2I technology. By providing an objective benchmark, enables a deeper understanding and refinement of prompt faithfulness metrics, ensuring their alignment with the semantic content of text prompts. This contributes significantly to the development of more effective and semantically aware T2I models, bolstering the reliability of generated images for a wide array of applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.