CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias (2308.12539v3)

Published 24 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of LLMs (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 LLMs, and find that for 2 LLM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.

References (73)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces CALM, a comprehensive benchmark for assessing language model bias across gender, race, and other demographic dimensions.
It employs a targeted word list and 5-shot prompting in sentiment tasks to uncover minimal gender differences in models like Falcon-7B and Llama-2.
The study highlights limitations such as overlapping demographic indicators and calls for prompt standardization and broader cultural adaptations.

Insights into Bias Evaluation in LLMs Using the CALM Dataset

The paper under discussion addresses the intricate challenge of assessing biases in LLMs (LMs) through the construction and evaluation of the Comprehensive Assessment of LLM (CALM) dataset. This paper highlights several notable findings and provides a critical analysis of the CALM dataset's efficacy in gauging biases, especially across various demographic dimensions like gender and race.

The creation of the CALM dataset is marked by a strategic selection of a target word list that emphasizes representation from seven social groups within the United States. This approach, while initially limited, offers a foundational step toward a broader geographic and cultural scope by incorporating names from various national origins. The authors provide scripts for replicating and evaluating LM biases across these diverse groups, although they recognize that the templates employed are solely in English, suggesting a potential for adaptation into other languages with careful linguistic and cultural considerations.

Central to the paper is the evaluation of several LMs on sentiment analysis tasks using the CALM dataset, as demonstrated in the gender-wise performance results table. For instance, results indicate minimal differences in sentiment analysis accuracy for models like Falcon-7B and Llama-2 across male, female, and gender-neutral categories. This finding suggests that increased data diversity within the CALM dataset may contribute to attenuating observed biases in model outputs.

Despite the dataset's potential to uncover biases, the paper acknowledges the intricacies involved in evaluating text generation models. A prominent limitation cited is the variability in baseline performance and bias severity across different models and tasks, which hampers comprehensive bias quantification. Additionally, the presence of overlapping names in gender and race categories introduces potential interdependencies in bias scores, indicating a need for innovative methodologies to separate these influences effectively.

Prompts play a critical role in model performance; thus, the authors have employed a 5-shot prompting technique, leveraging prompt structures from studies by Liang et al. (2022) and Brown et al. (2020). However, a note is made on the challenges of unknown training prompts for many LMs, advocating the need for prompt standardization to facilitate better cross-model comparability.

A speculative outlook on future research directions emphasizes the development of frameworks capable of integrating multiple tasks to derive comprehensive bias assessments. Additionally, the pursuit of methods to entirely disentangle bias categories and establishment of standardized prompts remains an essential frontier for enhancing the robustness and fairness of LM evaluations.

In theoretical and practical implications, the CALM dataset serves as a pivotal framework to refine bias assessment and mitigation strategies for LMs. It embodies an important progression in addressing the evolving biases as LLMs broaden their scope and capabilities. The research underscores a rigorous approach in applying comprehensive bias metrics and paves the way for nuanced understandings of biases in LMs, with significant implications for AI's role in addressing sociocultural disparities globally.