Papers
Topics
Authors
Recent
2000 character limit reached

Realistic Evaluation of Toxicity in Large Language Models

Published 17 May 2024 in cs.CL and cs.AI | (2405.10659v2)

Abstract: LLMs have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Hatebert: Retraining bert for abusive language detection in english. arXiv preprint arXiv:2010.12472.
  3. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  4. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  7. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  8. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509.
  9. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825.
  11. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  12. Orca 2: Teaching small language models how to reason.
  13. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  14. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  16. Zephyr: Direct distillation of lm alignment.
  17. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  18. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  19. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  20. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.
  21. Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867.
Citations (1)

Summary

  • The paper presents the TET dataset, constructed from authentic adversarial prompts, as a more realistic framework for evaluating LLM toxicity.
  • It systematically compares TET to conventional benchmarks, showing that LLMs exhibit significantly higher toxicity under realistic, manually curated attacks.
  • The study demonstrates that standard defenses, including toxicity classifiers and defensive prompts, are inadequate to fully mitigate harmful outputs in LLMs.

Realistic Evaluation of Toxicity in LLMs

Motivation and Background

As LLMs are deployed in increasingly diverse real-world contexts, rigorous assessment of their safety, particularly regarding toxic output, is essential. Existing benchmarks for evaluating LLM toxicity, such as RealToxicityPrompts and ToxiGen, suffer from limitations in prompt realism and fail to capture sophisticated prompt engineering attacks ("jailbreaks") commonly observed in human-LLM interactions. These benchmarks are built on artificially constructed or machine-generated prompts, which inadequately represent authentic usage scenarios or adversarial attempts to subvert safety mechanisms.

The paper "Realistic Evaluation of Toxicity in LLMs" (2405.10659) introduces the Thoroughly Engineered Toxicity (TET) dataset to address these evaluation shortcomings. TET consists of 2,546 prompts manually filtered from over 1 million interactions sourced from 210,000 unique IPs on Vicuna and Chatbot Arena, providing a realistic distribution of adversarial and toxic prompting strategies. The authors systematically compare TET's efficacy to prior benchmarks and analyze LLM defense capabilities across multiple model families.

Dataset Design and Properties

TET is constructed via a two-stage filtering process. HateBERT is employed to pre-select prompts that elicited toxic responses in real-world conversations, after which the resulting subset is further ranked according to six discrete toxicity metrics (toxicity, severe toxicity, identity attack, insult, profanity, threat) as determined by Perspective API. Representative prompts across these metrics populate TET, ensuring both broad toxicity coverage and alignment with experienced adversarial tactics.

TET prompts display a more distributed and higher toxicity profile compared to matched subsets sampled from ToxiGen, here denoted ToxiGen-S. ToxiGen-S is specifically re-sampled to match the overall toxicity distribution of TET for controlled comparison. Figure 1

Figure 1

Figure 1: Illustration of the general-toxicity score distributions of TET (orange) and ToxiGen-S (blue).

The dataset further includes both typical toxic queries and jailbreak prompt templates, exemplifying methods used to circumvent model defenses. Figure 2

Figure 2: Example of a prompt in TET dataset.

Figure 3

Figure 3: Example of a prompt created using the ToxiGen dataset.

Figure 4

Figure 4: Five of the jailbreak templates in the TET dataset.

Experimental Results: LLM Toxicity Assessment

The authors systematically evaluate seven leading LLMs—ChatGPT 3.5, Gemini Pro, Llama2 (7B, 13B, 70B), Mistral-7B-v0.1, Mixtral-8x7B-v0.1, OpenChat 3.5, Orca2 (7B, 13B), and Zephyr-7B—on both TET and ToxiGen-S. All responses are scored with Perspective API across six toxicity dimensions. Notable trends include:

  • TET consistently induces higher toxicity scores than ToxiGen-S across all evaluated models and toxicity dimensions. For example, ChatGPT 3.5 achieves a toxicity score of 24.404 on TET, but only 5.284 on ToxiGen-S; Zephyr-7B-ß increases from 18.491 to 53.888.
  • The Llama2-70B-Chat model demonstrates the strongest resistance, achieving the lowest toxicity scores on TET. Conversely, Mistral-7B-v0.1, OpenChat 3.5, and Zephyr-7B-ß rank among the most susceptible, with notably elevated scores.
  • Model performance is highly sensitive to specific jailbreak templates; defense effectiveness is not uniform within or across architectures. Individual models show distinctive weaknesses for particular jailbreak styles.

Comparative Analysis: Dataset Efficacy

The comparative evaluation reveals that TET elicits revealable toxicity in LLMs even when the prompt toxicity level matches ToxiGen-S. The only exception detected was in the Identity Attack metric for Llama2-7B-Chat. This underscores that previous state-of-the-art datasets underestimate the propensity of LLMs to generate toxic outputs under realistic adversarial conditions.

Implications for Defense Mechanisms

The analysis extends to evaluation of standard defenses. Toxicity classifiers such as HateBERT and Perspective API can filter overtly harmful prompts, but are inadequate against sophisticated jailbreaks. Attempts to mitigate via defensive system prompts (e.g., Meta's for Llama-2-Chat) yield variable improvements and, in some circumstances, exacerbate toxicity. Results suggest that effective defense requires explicitly training models against adversarial and jailbreak prompting strategies, otherwise even "safe" models output substantial toxic content when challenged with realistic prompts.

Theoretical and Practical Implications

The introduction of TET advances the field toward more representative and rigorous toxicity evaluation. Practically, these findings indicate that widely deployed LLMs, when subjected to authentic adversarial approaches, are still prone to generating harmful content, and existing safeguards are insufficient in robust real-world settings.

TET's framework for dataset construction establishes new standards for evaluating both explicit and implicit defense mechanisms. The marked difference in results between TET and conventional datasets demonstrates that prior evaluations relying on synthetic or artificially curated prompts materially underestimate risk.

Theoretically, the research calls for reconceptualizing toxicity detection: it is insufficient to simply classify prompt toxicity; models must be audited for their propensity to produce toxic outputs when faced with innocuous or adversarially engineered prompts. This finding dovetails with emerging directions in robustification, red teaming, and dynamic defense adaptation in the context of LLM safety.

Future Directions

The authors note limitations including lack of conversational scenario evaluation and constraints on computational resources for benchmarking more extensive model variants. Future research is recommended to encompass broader conversational contexts and expand benchmark coverage, as well as to pursue more holistic prompt-response safety mechanisms that transcend static prompt classification.

Conclusion

The TET dataset offers a foundation for realistic and adversarial evaluation of toxicity in LLMs, outperforming existing benchmarks with respect to exposure of harmful model behavior. Comprehensive cross-model experiments reveal current defenses are insufficient, particularly against jailbreak prompts. The impetus is clear for the AI community to adopt more rigorous testing standards and develop advanced mitigation techniques that address not only prompt toxicity but the model's vulnerability to eliciting and generating toxic outputs under adversarial conditions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.