Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ShieldGemma: Generative AI Content Moderation Based on Gemma (2407.21772v2)

Published 31 Jul 2024 in cs.CL and cs.LG

Abstract: We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A. Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
  3. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024.
  6. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944, 2021.
  7. Rethinking conversational agents in the era of llms: Proactivity, non-collaborativity, and beyond. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 298–301, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Self-guided noise-free data generation for efficient zero-shot learning. arXiv preprint arXiv:2205.12679, 2022.
  10. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024.
  11. Google. Perspective api. https://www.perspectiveapi.com/, 2017.
  12. Google. Responsible generative ai toolkit: https://ai.google.dev/responsible/principles, 2024.
  13. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
  14. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839, 2024.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  16. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. URL https://arxiv.org/abs/2307.04657.
  17. Ask me what you need: Product retrieval using knowledge from gpt-3. arXiv preprint arXiv:2207.02516, 2022.
  18. Harnessing large-language models to generate private synthetic text. arXiv preprint arXiv:2306.01684, 2023.
  19. Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
  20. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
  21. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023. URL https://arxiv.org/abs/2310.17389.
  22. From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv preprint arXiv:2401.02777, 2024.
  23. On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126, 2024.
  24. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023.
  25. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249.
  26. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  27. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592, 2023.
  28. Data augmentation for intent classification with off-the-shelf large language models. arXiv preprint arXiv:2204.01959, 2022.
  29. O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
  30. " i’m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. arXiv preprint arXiv:2205.09209, 2022.
  31. G. Team. Gemma. 2024a. 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  33. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  34. L. Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024b.
  35. Mix-and-match tuning for self-supervised semantic segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Citations (15)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com