NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models (2404.12464v7)

Published 18 Apr 2024 in cs.CL

Abstract: To be effectively and safely deployed to global user populations, LLMs must adapt outputs to user values and culture, not just know about them. We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability, specifically measuring their ability to judge social acceptability across different levels of cultural norm specificity, from abstract values to explicit social norms. As an instantiation of our framework, we create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries. Through comprehensive experiments on NormAd-Eti, we find that LLMs struggle to accurately judge social acceptability across these varying degrees of cultural contexts and show stronger adaptability to English-centric cultures over those from the Global South. Even in the simplest setting where the relevant social norms are provided, our best models' performance (<82%) lags behind humans (>95%). In settings with abstract values and country information, model performance drops substantially (<60%), while human accuracy remains high (>90%). Furthermore, we find that models are better at recognizing socially acceptable versus unacceptable situations. Our findings showcase the current pitfalls in socio-cultural reasoning of LLMs which hinder their adaptability for global audiences.

References (44)

Authors (5)

Abhinav Rao (8 papers)
Akhila Yerukola (14 papers)
Vishwa Shah (6 papers)
Katharina Reinecke (16 papers)
Maarten Sap (87 papers)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces NormAd, a dataset with 2.6k culturally diverse stories from 75 countries to benchmark LLM cultural adaptability.
The experiments reveal that LLMs struggle with non-Western cultural norms, with top models achieving only up to 81.8% accuracy versus human performance at 95.6%.
It identifies inherent biases where models favor norm conformity over detecting deviations, urging the development of culturally aware training approaches.

Evaluating the Cultural Adaptability of LLMs with the NormAd Dataset

Introduction to NormAd Dataset

In this paper, the authors introduce NormAd, a novel dataset designed to rigorously assess the cultural adaptability of LLMs. It contains 2.6k stories that operationalize cultural norms across 75 countries for a comprehensive evaluation. Each story in NormAd is accompanied by question-answer pairs to measure a model's ability to handle normative social acceptability under different cultural contexts.

Key Findings

The authors present several key findings:

Model Performance in Different Contexts: LLMs exhibit difficulties across all contextual granularities, particularly with non-English-centric cultural norms. Notably, even the top-performing models like Mistral-7b-Instruct only reach an accuracy up to 81.8%, considerably below human performance which stands at 95.6%.
Accuracy Across Cultural Norms: LLMs show marked deficiencies in adapting outputs suitable for culturally diverse contexts. The struggle is pronounced in scenarios involving norm violations and culturally distinct practices like gift-giving across different cultures.
Bias Identification: The models demonstrate bias towards verifying the acceptability of stories adhering to cultural norms rather than identifying deviations, underlining the presence of inherent agreement biases in current LLM setups.

Dataset Construction and Validation

Narrative Generation: Leveraging the Cultural Atlas, the researchers have meticulously generated narrative stories encapsulating daily scenarios influenced by specific Rules of Thumb (RoT), broad Value paradigms, and Country-specific information.

Validation Methods: The dataset underwent robust automated and manual validation processes to ensure the relevance and cultural accuracy of the narratives, encompassing various checks including relevance of RoT to the stories, and the entailment between Values and RoTs.

Experimental Results

In detailed experiments using the NormAd dataset, the results indicate:

Contextualization Challenges: Models have shown lower accuracy scores when dealing with broader Value and specific Country contexts compared to the more detailed RoT context, which presents all necessary information directly.
Parameter Effect: There is a slight improvement in performance with increased model parameters; however, this is not linear and shows diminishing returns at higher scales.
Cultural Performance Discrepancy: There is a noticeable performance disparity across cultures, where models tend to perform better on narratives based on Western norms compared to those from non-Western countries such as those in the African-Islamic cultural zones.

Theoretical and Practical Implications

Theoretical Implications: The findings challenge the robustness and the claimed universality of LLMs, underscoring the significant need for models that can genuinely understand and adapt to the cultural complexities of global user bases in an equitable manner.

Practical Implications: Practically, the results advocate for a reconsideration of how cultural adaptability is integrated and evaluated in LLMs, suggesting that merely increasing model size or relying on current training methods may not adequately address the biases and performance issues observed.

Future Research Directions

The authors propose a focus on enhancing cultural reasoning capabilities within LLMs by improving contextual understanding and adaptability during both training and inference. Future research could explore more dynamic and contextually aware training methodologies and perhaps multilingual and multicultural integration to better reflect global diversity.

Conclusion

Overall, this paper provides a critical look at the current limitations of LLMs in handling cultural diversity through the lens of the new, comprehensive NormAd dataset. It sets a benchmark for future research aimed at creating more culturally competent and globally equitable AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AetherSuRa/status/1783943299029471246

https://twitter.com/akhila_yerukola/status/1784011263078478200

https://twitter.com/LChoshen/status/1782378019224518947

https://twitter.com/MaartenSap/status/1900518763075977379

https://twitter.com/gastronomy/status/1782258926479659196

https://twitter.com/GptMaestro/status/1786912049433596081