Emergent Mind

Decomposed evaluations of geographic disparities in text-to-image models

(2406.11988)
Published Jun 17, 2024 in cs.CV , cs.AI , cs.CY , and cs.LG

Abstract

Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these disparities to specific parts of the generated images. In this work, we introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to separately measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds and that backgrounds in generated images tend to contain larger regional disparities than objects. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings. Informed by our metric, we use a new prompting structure that enables a 52% worst-region improvement and a 20% average improvement in generated background diversity.

Overview

  • The paper introduces Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), a novel set of metrics for analyzing geographic disparities in text-to-image models by separately evaluating object and background representations.

  • Using the Segment Anything Model and Vision Transformers, Decomposed-DIG provides a granular approach to measuring realism and diversity, revealing higher realism in generated objects compared to backgrounds, and identifying specific geographic biases in image generation.

  • The study explores mitigation strategies through prompt engineering, achieving significant improvements in background diversity, and highlights the importance of this decomposed approach for future research and development aimed at reducing biases in generative AI systems.

Decomposed Evaluations of Geographic Disparities in Text-to-Image Models

The evaluated paper introduces Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG) as a novel set of metrics that scrutinize geographic disparities in text-to-image generation systems. This work builds on existing metrics by disentangling geographic disparities into object and background representations within generated images.

Introduction and Motivation

Recent advancements in text-to-image generative models, particularly latent diffusion models (LDMs), have revolutionized visual content creation. However, increasing scrutiny has exposed significant geographic biases in these models, such as over-representation of stereotypical imagery and underrepresentation of certain geographic regions. Traditional evaluation metrics fail to attribute these disparities to specific components of generated images, prompting the need for more granular analysis.

Methodology: Decomposed-DIG

The paper proposes Decomposed-DIG, which extends the precision and coverage metrics from earlier works [4] to separately measure disparities in objects and backgrounds. The procedure involves:

  1. Object and Background Segmentation: Utilizes the Segment Anything Model (SAM) in conjunction with GroundingDINO's object detection to achieve precise segmentation.
  2. Feature Extraction: Employs a Vision Transformer (ViT) for feature extraction, isolating object and background features based on segmented regions.
  3. Benchmarking: Computes precision (as a proxy for realism) and coverage (as a proxy for diversity) for object-specific and background-specific segments.

Key Findings

Using Decomposed-DIG, the study evaluates a widely-used LDM and uncovers nuanced patterns of geographic disparities:

  1. Realism and Diversity: Generated objects exhibit higher realism compared to backgrounds. Object representations are more consistent across regions than backgrounds, which show approximately 1.7 times larger disparities between geographic regions.
  2. Specific Disparities: The paper provides detailed, qualitative insights. For example, backgrounds in African regions often lack modern infrastructure like paved streets, and generated images frequently place objects in unrealistic settings (e.g., cooking pots outdoors in Europe).

Mitigations via Prompt Engineering

Building on these findings, the authors explore an early mitigation strategy through prompt engineering. By modifying the prompt structure from "{object} in {region}" to "{regional adjective} {object}", the study achieves notable improvements:

  • A 52% enhancement in background diversity for the worst-performing region (Africa).
  • A 20% average improvement in background diversity without significantly compromising realism or object diversity.

Implications and Future Work

The introduction of Decomposed-DIG offers a more granular and interpretable framework for evaluating geographic disparities in text-to-image models. This work underscores the importance of distinguishing between object and background disparities, thus making subsequent bias identification and mitigation strategies more effective.

Future developments could focus on extending this methodology to other aspects of image generation, such as color biases or temporal disparities. Moreover, incorporating Decomposed-DIG into the training process could directly influence model development, promoting fairness and inclusivity in generative AI systems.

By isolating object and background biases, the paper facilitates a deeper understanding of where and why disparities arise, guiding the development of more geographically equitable text-to-image models.

Conclusion

The paper presents a significant step forward in the nuanced evaluation of geographic disparities in text-to-image models. Decomposed-DIG not only clarifies the origins of these disparities but also provides a solid foundation for developing targeted mitigation strategies. This methodology invites further research into decomposed evaluations, potentially extending beyond geographic disparities to other socio-demographic factors, thus broadening the scope of equitable AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.