Emergent Mind

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

(2406.00020)
Published May 23, 2024 in cs.CL and cs.CY

Abstract

Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach LLMs to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 <= 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users.

Tweet templates from gender-queer authors: original, position-marked, and inserted reclaimed slurs.

Overview

  • In the paper, Dorn et al. explore the biases present in language models when detecting harmful speech within gender-queer dialects, using the QueerReclaimLex dataset formulated around non-derogatory uses of LGBTQ+ slurs.

  • The study evaluates five off-the-shelf language models—Detoxify, Perspective, GPT-3.5, LLaMA 2, and Mistral—using various prompting schemas and reveals high false positive rates particularly in identifying non-derogatory slur uses by in-group members.

  • Key findings indicate that current language models struggle to contextualize slur usage accurately, leading to significant false positives, and suggest a need for enhanced algorithms that better comprehend linguistic and contextual nuances in gender-queer speech.

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

In the paper titled "Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias," Dorn et al. scrutinize the performance of language models in classifying harmful speech within gender-queer dialects. The study highlights potential biases in content moderation systems, formulated around a novel dataset called QueerReclaimLex.

Dataset and Methodology

QueerReclaimLex is an instrumental part of this study, created to probe the biases of language models toward non-derogatory uses of LGBTQ+ slurs. The dataset is derived from the NB-TwitCorpus3M, which contains approximately 3 million tweets from users with non-binary pronouns listed in their biographies. Each instance in the dataset is formed by substituting slurs into curated templates from real tweets authored by non-binary individuals. This method results in a collection of posts exemplifying linguistic reclamation of derogatory terms. Annotators, all identifying as gender-queer, systematically labeled these instances for harmfulness based on whether the author was part of the in-group or out-group of the slur.

The paper evaluates five off-the-shelf language models: Detoxify, Perspective (toxicity classifiers), and GPT-3.5, LLaMA 2, and Mistral (LLMs). Different prompting schemas are tested to provide models with additional context about speaker identity, including vanilla (zero-shot), identity (author's in-group/out-group status), and identity-cot (chain-of-thought prompting).

Key Findings

Annotator Agreement and Harm Assessment

Annotator agreement was notably higher for posts authored by in-group members as compared to out-group members, measured by Cohen's kappa coefficients (0.80 for in-group versus 0.60 for out-group). In-group posts were labeled as harmful in 15.5% of instances, whereas out-group posts reached 82.4%. This delineates a significant differential in judgments of harm based on the author’s inclusion in the targeted group. Specific forms of slur use, such as 'Group Label' and 'Sarcasm,' were more likely to be harmful when used by in-group members, suggesting intra-group derogation. Conversely, slur uses embedded in quotes or discussions, particularly concerning identity, were less likely to be deemed harmful when used by out-group members.

Language Models and Speaker Identity

Toxicity classifiers like Detoxify and Perspective demonstrated substantial false positive rates in identifying harmful speech within gender-queer dialects authored by in-group members, with F1 scores not exceeding 0.25. This suggests that these classifiers over-rely on the presence of slurs without considering nuanced contextual cues, leading to the inadvertent marginalization of gender-queer voices.

The study revealed that LLMs also struggle with identifying non-derogatory uses of slurs, reflected in poor F1 scores for the vanilla schema (F1 $\leq$ 0.36). Although the introduction of in-group/out-group identity context improved model performance slightly, false positive rates for ingroup speech remained notably high. The integration of chain-of-thought reasoning through identity-cot prompted some improvement, yet the models consistently failed to achieve satisfactory precision.

An additional analysis on the dataset subset, where the posts contained clear contextual indicators of in-group membership, showed that model performance remained abysmally low (F1 $\leq$ 0.24). This indicates that LLMs do not sufficiently leverage context to accurately assess the harm of slur use within gender-queer dialects.

Slur-Specific Model Behavior

The study also highlighted that models assign different levels of harm to various slurs. For example, slurs such as 'fag,' 'shemale,' and 'tranny' were consistently scored higher in harm across models. The dependence of harm scores on specific slurs decreased when identity context and chain-of-thought prompting were employed. This suggests that additional context can partially mitigate spurious correlations between slurs and determined harm.

Implications and Future Research

This work underscores the need for refining content moderation algorithms to fairly represent and protect gender-queer communities. The significant false positive rates in the detection of harmful speech by both toxicity classifiers and LLMs signal an urgent need for models that incorporate nuanced linguistic and contextual cues. This could entail the development of datasets rich in in-group speech or improved models that can dynamically integrate identity context.

Future research could expand upon this work by exploring a more diverse range of marginalized communities or by experimenting with training models to align more closely with the linguistic norms of these groups. Additionally, supplementing language models with annotated data from a wider demographic could calibrate more equitable moderation practices.

The findings presented by Dorn et al. are crucial for informing the development of more inclusive AI systems, enhancing fairness in digital discourse, and ensuring the responsible deployment of technology in online social spaces.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.