An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models

Published 22 Jan 2023 in cs.CL and cs.AI | (2301.09211v1)

Abstract: Large-scale Pre-Trained LLMs (PTLMs) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. In this paper, we leverage the primary task of PTLMs, i.e., language modeling, and propose a new metric to quantify manifested implicit representational harms in PTLMs towards 13 marginalized demographics. Using this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our analysis provides insights into the correlation between the proposed metric in this work and other related metrics for representational harm. We observe that our metric correlates with most of the gender-specific metrics in the literature. Through extensive experiments, we explore the connections between PTLMs architectures and representational harms across two dimensions: depth and width of the networks. We found that prioritizing depth over width, mitigates representational harms in some PTLMs. Our code and data can be found at https://github.com/microsoft/SafeNLP.

Abstract PDF Upgrade to Chat

Citations (19)

View on Semantic Scholar

Summary

The paper introduces a novel metric that quantifies implicit representational harms using language modeling likelihood comparisons.
It applies a two-stage social science approach to define demographics and operationalize bias measurements across 24 pre-trained language models.
Findings suggest that deeper architectures may mitigate biases, emphasizing the need for strategic model design and fairness metrics in AI.

An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained LLMs

This paper presents a thorough investigation into representational harms within large-scale Pre-Trained LLMs (PTLMs), specifically focusing on the biases that these models may harbor against marginalized groups. Given the pervasive adoption of PTLMs in natural language processing tasks, it is crucial to understand and mitigate any societal biases they might perpetuate. The study introduces a novel metric aimed at quantifying implicit representational harms targeted at 13 specific demographic groups. Using this metric, the authors conduct an empirical analysis across 24 well-known PTLMs.

Key Contributions

The authors make two primary contributions. First, they offer a clear conceptualization of representational harms toward marginalized groups and introduce a metric to quantify these phenomena within PTLMs. The measurement model utilized here adheres to methodologies from the social sciences, adopting a two-stage approach: conceptualization and operationalization. Conceptualization involves defining the target demographics and representational harms, while operationalization assesses these harms using a language modeling-based likelihood comparison of harmful versus benign statements.

Second, the paper presents an empirical evaluation of representational harms in PTLMs, analyzing how network architecture elements such as depth and width influence these biases. Notably, the study finds that prioritizing network depth over width can sometimes mitigate these harms.

Methodology

The metric centers around language modeling objectives, measured using perplexity, or pseudo-perplexity in auto-encoder models, effectively gauging the likelihood of implicitly harmful vs. benign statements produced by the models. The evaluation dataset is a subset of the ToxiGen dataset, annotated to differentiate harmful content from benign among 13 marginalized groups.

An intriguing methodological choice is the Mann-Whitney U-test, which quantifies the likelihood disparities and provides a 'safety score'. This score reflects the efficacy of the metric in capturing implicit biases, with higher scores indicating greater likelihoods of benign sentences compared to harmful ones in the model output.

Results and Implications

The safety scores reveal that PTLMs are prone to manifesting considerable representational harms, with a noted tendency to differentially affect marginalized demographics. Variations in safety scores across models also suggest that PTLMs' internal architectures significantly influence their bias levels. Specifically, deeper models incur less representational harm when compared to wider models.

The implications of this research are multifaceted. Practically, the findings highlight the need for careful architectural considerations in model development and suggest that future architectural innovations should consider bias mitigation as a core component. Theoretically, the work reinforces the necessity of diverse, interdisciplinary approaches to understanding and mitigating biases in AI systems. This includes integrating insights from social sciences to refine metrics and model fairness.

The results demonstrate that current intrinsic and extrinsic metrics used for bias assessment capture different aspects of representational harms—and help highlight unknown biases—indicating a potential gap in existing evaluation frameworks that this paper begins to bridge. The authors advocate for expanded bias study metrics and datasets, pushing for systemic evaluations that help align technical improvements with ethical AI commitments.

Future Directions

Future research could expand on interactional demographics and investigate bias dynamics within combined marginalized groups—such as Middle Eastern women—to provide a more holistic evaluation. Furthermore, the potential application of the safety score as an objective function in training PTLMs presents an intriguing avenue for developing more equitable AI models.

Overall, this paper contributes significantly to the ongoing discourse on social biases in AI, advocating for more comprehensive strategies to ensure fairness and equity in LLMs’ development and deployment.

Markdown Report Issue