Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 144 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

ShieldGemma: Generative AI Content Moderation Based on Gemma (2407.21772v2)

Published 31 Jul 2024 in cs.CL and cs.LG

Abstract: We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A. Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
  3. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024.
  6. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944, 2021.
  7. Rethinking conversational agents in the era of llms: Proactivity, non-collaborativity, and beyond. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 298–301, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Self-guided noise-free data generation for efficient zero-shot learning. arXiv preprint arXiv:2205.12679, 2022.
  10. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024.
  11. Google. Perspective api. https://www.perspectiveapi.com/, 2017.
  12. Google. Responsible generative ai toolkit: https://ai.google.dev/responsible/principles, 2024.
  13. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
  14. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839, 2024.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  16. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. URL https://arxiv.org/abs/2307.04657.
  17. Ask me what you need: Product retrieval using knowledge from gpt-3. arXiv preprint arXiv:2207.02516, 2022.
  18. Harnessing large-language models to generate private synthetic text. arXiv preprint arXiv:2306.01684, 2023.
  19. Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
  20. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
  21. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023. URL https://arxiv.org/abs/2310.17389.
  22. From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv preprint arXiv:2401.02777, 2024.
  23. On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126, 2024.
  24. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023.
  25. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249.
  26. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  27. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592, 2023.
  28. Data augmentation for intent classification with off-the-shelf large language models. arXiv preprint arXiv:2204.01959, 2022.
  29. O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
  30. " i’m sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. arXiv preprint arXiv:2205.09209, 2022.
  31. G. Team. Gemma. 2024a. 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  33. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  34. L. Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024b.
  35. Mix-and-match tuning for self-supervised semantic segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Citations (15)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 4 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube