Beyond Aesthetics: Cultural Competence in Text-to-Image Models (2407.06863v6)

Published 9 Jul 2024 in cs.CV

Abstract: Text-to-Image (T2I) models are being increasingly adopted in diverse global communities where they create visual representations of their unique cultures. Current T2I benchmarks primarily focus on faithfulness, aesthetics, and realism of generated images, overlooking the critical dimension of cultural competence. In this work, we introduce a framework to evaluate cultural competence of T2I models along two crucial dimensions: cultural awareness and cultural diversity, and present a scalable approach using a combination of structured knowledge bases and LLMs to build a large dataset of cultural artifacts to enable this evaluation. In particular, we apply this approach to build CUBE (CUltural BEnchmark for Text-to-Image models), a first-of-its-kind benchmark to evaluate cultural competence of T2I models. CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions and along 3 concepts: cuisine, landmarks, and art. CUBE consists of 1) CUBE-1K, a set of high-quality prompts that enable the evaluation of cultural awareness, and 2) CUBE-CSpace, a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity. We also introduce cultural diversity as a novel T2I evaluation component, leveraging quality-weighted Vendi score. Our evaluations reveal significant gaps in the cultural awareness of existing models across countries and provide valuable insights into the cultural diversity of T2I outputs for under-specified prompts. Our methodology is extendable to other cultural regions and concepts, and can facilitate the development of T2I models that better cater to the global population.

Citations (3)

View on Semantic Scholar

Summary

The paper presents CUBE, a benchmark that assesses cultural competence in T2I models using a gold-standard dataset and a vast collection of cultural artifacts.
The methodology combines structured knowledge bases, Large Language Models, and human ratings to evaluate cultural relevance across cuisine, landmarks, and art.
Results reveal significant biases in models like Imagen 2 and Stable Diffusion XL, highlighting the urgent need for culturally inclusive AI training.

Beyond Aesthetics: Cultural Competence in Text-to-Image Models

The rapid advancements in text-to-image (T2I) models have brought about revolutionary shifts in creative domains such as digital arts, advertising, and education. However, evaluating these models solely based on photo-realism, faithfulness, and aesthetics leaves a critical gap in understanding their cultural competence. This paper addresses this gap by introducing a comprehensive benchmark, CUBE (CUltural BEnchmark for Text-to-Image models), to evaluate T2I models on their cultural awareness and cultural diversity.

Contributions and Methodology

The key contributions of this work lie in the development of CUBE, which comprises two main components: CUBE-1K and CUBE-CSpace. The former is a gold-standard dataset of 1000 prompts crafted to evaluate the cultural awareness of T2I models, while the latter is an extensive resource containing approximately 300K cultural artifacts used for grounding and analyzing cultural diversity.

Cultural Awareness: The evaluation of cultural awareness is carried out using CUBE-1K, focusing on three main concepts: cuisine, landmarks, and art. The methodology involves using structured knowledge bases (KBs) such as WikiData to extract a vast array of cultural artifacts. This extraction is refined using LLMs like GPT-4-Turbo to filter and complete the collection, ensuring the inclusion of diverse and relevant artifacts. Human annotators from various cultures rated the generated images on cultural relevance, faithfulness, and realism, revealing notable gaps in the existing T2I models' ability to accurately and realistically represent diverse cultural artifacts.

Cultural Diversity: The paper introduces cultural diversity (CD) as a novel metric for evaluating T2I models. This metric leverages the quality-weighted Vendi score, which balances the diversity of artifacts with their generation quality. Various similarity kernels are defined to capture different facets of geo-cultural diversity, including continent-level, country-level, and artifact-level similarities.

Numerical Results and Implications

The human evaluation results demonstrate significant disparities in the cultural competence of T2I models across different countries and cultural concepts. Both Imagen 2 and Stable Diffusion XL showed extensive room for improvement, especially in representing artifacts from the Global South. The models frequently displayed biases towards well-represented and popular countries, with lower cultural awareness and diversity scores for countries like Brazil, Turkey, and Nigeria.

The quantitative results for cultural diversity, evaluated using a wide range of prompts and seeds, revealed that none of the models performed exceptionally well. Even the best models exhibited relatively low diversity scores, indicating a lack of comprehensive geo-cultural representation. This underscores the need for explicit prioritization of cultural diversity in the development and training of T2I models.

Future Directions

The findings of this paper have profound implications for the future development of T2I models. There is a clear need to integrate cultural competence as a core objective in model training and evaluation processes. This involves not just expanding datasets to include more culturally diverse inputs but also developing metrics that accurately capture the nuances of cultural representation.

Practical Implications: For practitioners, the CUBE benchmark provides a valuable tool to evaluate and improve the cultural competence of T2I models. Integrating such evaluations into the model development lifecycle can ensure that the models serve a truly global audience, mitigating the risks of cultural misrepresentation and bias.

Theoretical Implications: From a theoretical standpoint, the introduction of novel metrics to evaluate cultural diversity contributes to the broader understanding of what constitutes fairness and inclusivity in AI models. Future research can build on this work by exploring more granular cultural definitions and incorporating additional dimensions such as sub-cultures and co-cultures.

Conclusion

In conclusion, this paper makes significant strides in highlighting and addressing the gaps in cultural competence in T2I models. The creation of the CUBE benchmark and the introduction of cultural awareness and diversity as evaluation dimensions mark an important step towards developing more inclusive and globally representative AI technologies. Future advancements in this space will likely see the integration of these benchmarks and metrics into standard model evaluation frameworks, paving the way for AI systems that better cater to the rich and diverse tapestry of human cultures.