Papers
Topics
Authors
Recent
2000 character limit reached

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Published 22 Apr 2024 in cs.CL, cs.CY, and cs.LG | (2404.14397v2)

Abstract: LLMs and small LLMs (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

Citations (9)

Summary

  • The paper introduces RTP-LX, a dataset of toxic prompts in 28 languages enriched with professional transcreation and human annotation for cross-lingual comparability.
  • It evaluates seven S/LLMs using novel metrics and reveals that models excel in overt toxicity detection but struggle with subtle harms like microaggressions, bias, and identity attacks.
  • The study highlights risks in deploying unadapted S/LLMs, urging the need for targeted finetuning and richer, culturally nuanced evaluation protocols.

RTP-LX: Evaluating Multilingual LLM Toxicity Detection

Introduction

The paper "RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?" (2404.14397) addresses the scalability of safety measures for large and small LLMs (S/LLMs) in multilingual contexts. It introduces RTP-LX, a professionally transcreated and human-annotated dataset comprising toxic prompts and outputs in 28 languages, specifically constructed to capture culturally-specific toxicity. The study systematically analyzes seven S/LLMs and evaluates their performance in detecting toxicity using new metrics in diverse, context-rich linguistic scenarios.

Dataset Construction and Annotation Process

RTP-LX is seeded from the RTP corpus—originally containing US-centric, English-language toxic prompts mined from Reddit—and enriched via professional transcreation and human annotation. Two partitions form the corpus: (i) transcreated prompts for cross-lingual comparability, and (ii) manually constructed, culturally-specific prompts to target subtle toxicity not easily exposed by machine translation.

Annotation relies on eight harm categories: bias, identity attack, insult, microaggression, self-harm, sexual content, toxicity, and violence. Toxicity scores utilize a five-point Likert scale, whereas all other categories use a three-point scale, reflecting the ordinal nature of harm severity. Weighted Cohen's κw\kappa_w assesses inter-annotator agreement (IAA), accounting for both class imbalance and label ordinal structure.

Baseline Analysis: Lexical Toxicity via Block Lists

The baseline, using FLORES' Toxicity-200 block list, assessed explicit toxicity rates across partitions and languages. Figure 1

Figure 1: Exact-match block rates computed with FLORES Toxicity-200 list across 28 languages and RTP-LX partitions; manual prompts demonstrate a markedly lower block rate, underscoring the prevalence of subtle toxicity.

The average exact-match block rate was 43.4±0.1%43.4\pm0.1\%, suggesting a substantial presence of lexically toxic content, but manual prompts notably exhibited a 27%-27\% lower block rate. This implies that lexical block lists only partially capture the toxic potential, missing nuanced, context-dependent harm, especially in culturally specific scenarios.

S/LLM Model Evaluation: Accuracy and Agreement

Seven S/LLMs—from open and closed source (GPT-4 Turbo, Gemma 7B/2B, Mistral, Llama-2 7B/13B, Llama Guard)—were evaluated against RTP-LX annotations. GPT-4 Turbo and Gemma 7B demonstrated the highest accuracy in label matching, while Gemma 2B and Llama-2 7B underperformed. Azure Content Safety service (ACS) surpassed S/LLMs on raw accuracy but only covered half of the harm categories. Figure 2

Figure 2

Figure 2: Main evaluation results, displaying accuracy and Cohen's κw\kappa_w across models; GPT-4 Turbo leads in accuracy but agreement with annotators (kappa) reveals class imbalance issues.

However, examining weighted Cohen’s κw\kappa_w revealed that deceptively high accuracy commonly stemmed from class imbalance, where "lazy learners" could attain reasonable performance by outputting the modal label. This was notably evident in models prone to binary output.

Harm Category Breakdown: Model vs Human Agreement

Figure 3

Figure 3: Weighted Cohen’s κw\kappa_w per harm category for all models, highlighting proficiency in violent and sexual content detection but failures on microaggressions, bias, and identity attacks.

S/LLMs were successful in detecting explicit violent, sexual, and insulting content but consistently failed to capture subtle forms of harm—including microaggressions, bias, and identity attacks. Both the overall and per-category agreement indicate S/LLMs’ inadequacy for nuanced, context-sensitive toxicity evaluation.

Model-specific Agreement and Pathologies

Detailed breakdowns of IAA show that GPT-4 Turbo and ACS perform closest to human annotators (kappa difference of $0.13$ and $0.23$, respectively), whereas Gemma 2B and Llama variants show substantial deficiencies—especially for bias, identity attack, and microaggression. Figure 4

Figure 4

Figure 4

Figure 4: Cohen’s κw\kappa_w comparison for human annotators, ACS, and GPT-4 Turbo; ACS and GPT-4 Turbo approach human agreement, but exhibit deficits in subtle harm categories.

Figure 5

Figure 5

Figure 5

Figure 5: Cohen’s κw\kappa_w for Gemma and Mistral models; Gemma 2B's notably poor performance on nuanced harm categories highlights structural weaknesses.

Figure 6

Figure 6

Figure 6

Figure 6: Cohen’s κw\kappa_w for Llama variants; Llama Guard excels in toxicity and identity attacks, but no model reliably identifies microaggressions.

Pathological Labeling and Class Imbalance

Models revealed two primary pathologies: (1) tendency to output high-valued labels (interpreted as extreme harm in RTP-LX’s paradigm), and (2) oversimplification via binary labeling, missing contextually harmful subtleties. Figure 7

Figure 7

Figure 7

Figure 7

Figure 7: Label class distribution for bias, identity attack, insult, and microaggression; models bias toward higher-valued and binary labels, failing on nuanced discourse.

Figure 8

Figure 8

Figure 8

Figure 8

Figure 8: Label distribution for self-harm, sexual content, toxicity, and violence; models are sensitive in explicit categories but demonstrate false positive tendencies and limited detection of contextual harm.

Implications, Limitations, and Future Work

The findings highlight current S/LLMs’ limitations in serving as automated toxicity judges for multilingual scenarios. Accurate label matching does not equate to nuanced, reliable judgment—especially for subtle or context-dependent harm. Pathological tendencies point to architectural and training limitations, compounded by insufficient dialectal and cultural coverage in training data.

Practically, deploying S/LLMs without targeted finetuning risks both under-detection and over-labeling of harmful content, potentially exacerbating issues like erasure, especially for culturally nuanced toxicity. Theoretically, this underscores a need for richer, context-aware datasets and model adaptation strategies. Future work should expand RTP-LX for dialectal breadth and deeper cultural specificity, and enhance S/LLMs with targeted finetuning utilizing expert-annotated corpora.

Conclusion

RTP-LX provides rigorous evidence that, despite reasonable accuracy metrics, current S/LLMs are not reliable judges of toxicity in multilingual, context-rich scenarios. Their weaknesses are most pronounced for subtle harms (microaggression, bias, identity attack), where agreement with human annotators is low and output distributions reveal pathological behaviors. The corpus itself constitutes a critical resource for advancing the safe deployment of S/LLMs and informing model improvement, but further research is required on dialectal diversity, cultural specificity, and refined evaluation protocols.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.