Introducing v0.5 of the AI Safety Benchmark from MLCommons (2404.12241v2)
Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned LLMs. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned LLMs; (7) a test specification for the benchmark.
- Datasheets for datasets, 2021.
- Mlperf inference benchmark, 2020.
- S. McGregor. Open digital safety. Computer, 57(04):99–103, apr 2024. ISSN 1558-0814. doi: 10.1109/MC.2023.3315028.
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, 2023.
- Promises and pitfalls of artificial intelligence for legal applications, 2024a.
- Financebench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944, 2023.
- Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative ai. npj Digital Medicine, 7(1):82, 2024.
- Interactive storytelling for children: A case-study of design and development considerations for ethical conversational ai. International Journal of Child-Computer Interaction, 32:100403, 2022. ISSN 2212-8689. doi: https://doi.org/10.1016/j.ijcci.2021.100403. URL https://www.sciencedirect.com/science/article/pii/S2212868921000921.
- Concrete problems in ai safety, 2016.
- The situation awareness framework for explainable ai (safe-ai) and human factors considerations for xai systems. International Journal of Human–Computer Interaction, 38(18-20):1772–1788, 2022. doi: 10.1080/10447318.2022.2081282. URL https://doi.org/10.1080/10447318.2022.2081282.
- Rishi Bommasani et al. On the opportunities and risks of foundation models, 2022a.
- Generative ai meets responsible ai: Practical challenges and opportunities. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 5805–5806, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599557. URL https://doi.org/10.1145/3580305.3599557.
- Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety, 2024a.
- The malicious use of artificial intelligence: Forecasting, prevention, and mitigation, 2018.
- Rise of concerns about ai: reflections and directions. Communications of the ACM, 58(10):38–40, 2015.
- Sociotechnical safety evaluation of generative ai systems, 2023.
- Not my voice! a taxonomy of ethical and safety harms of speech generators, 2024.
- ISO/IEC/IEEE. Iso/iec/ieee 24748-7000:2022. systems and software engineering life cycle management part 7000: Standard model process for addressing ethical concerns during system design, 2024. URL https://www.iso.org/standard/84893.html.
- Holistic evaluation of language models, 2023.
- Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022b.
- Holistic evaluation of text-to-image models, 2023.
- On evaluating adversarial robustness of large vision-language models, 2023.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Building open-source ai. Nature Computational Science, 3(11):908–911, 2023.
- Training data leakage analysis in language models. arXiv preprint arXiv:2101.05405, 2021.
- Benchmark probing: Investigating data leakage in large language models. In NeurIPS 2023 Workshop on Backdoors in Deep Learning - The Good, the Bad, and the Ugly, 2024. URL https://openreview.net/forum?id=a34bgvner1.
- Private benchmarking to prevent contamination and improve comparative evaluation of llms, 2024.
- Simplesafetytests: a test suite for identifying critical safety risks in large language models, 2024.
- Xstest: A test suite for identifying exaggerated safety behaviours in large language models, 2024b.
- Heather Hedden. The accidental taxonomist. Information Today, Inc, Medford, New Jersey, third edition edition, 2022. ISBN 978-1-57387-681-0 978-1-57387-682-7.
- Kevin Klyman. Acceptable use policies for foundation models, 2024. URL https://crfm.stanford.edu/2024/04/08/aups.html.
- This prompt is measuring< mask>: evaluating bias evaluation in language models. arXiv preprint arXiv:2305.12757, 2023.
- Discovering language model behaviors with model-written evaluations, 2022.
- Evaluating and mitigating discrimination in language model decisions. arXiv preprint arXiv:2312.03689, 2023.
- Safetybench: Evaluating the safety of large language models with multiple choice questions, 2023.
- Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
- Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems, 36, 2024.
- Validating large language models with relm. In Sixth Conference on Machine Learning and Systems (MLSys 2023), June 2023.
- Honest: Measuring hurtful sentence completion in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021. doi: 10.1145/3442188.3445924. URL http://dx.doi.org/10.1145/3442188.3445924.
- Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083, 2019.
- Large language models are not fair evaluators, 2023.
- A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024.
- Mart: Improving llm safety with multi-round automatic red-teaming, 2023.
- Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications, 2023.
- Rainbow teaming: Open-ended generation of diverse adversarial prompts, 2024.
- Gradient-based language model red teaming, 2024.
- Manfred Bierwisch John R. Searle, Ferenc Kiefer. Speech act theory and pragmatics, 1980.
- Uncalibrated models can improve human-ai collaboration, 2022.
- Ko de Ruyter Stephan Ludwig. Decoding social media speak: developing a speech act theory research agenda, 2016.
- Emanuele Arielli. Sharing as speech act, 2018. URL https://philarchive.org/archive/ARISAS.
- Michael Randall Barnes. Who do you speak for? and how?, 2023. URL https://www.rivisteweb.it/doi/10.14649/91354.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
- Challenges in detoxifying language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.210. URL https://aclanthology.org/2021.findings-emnlp.210.
- A new generation of perspective api: Efficient multilingual character-level transformers, 2022.
- HateCheck: Functional tests for hate speech detection models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.4. URL https://aclanthology.org/2021.acl-long.4.
- Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1352–1368, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.97. URL https://aclanthology.org/2022.naacl-main.97.
- Critical perspectives: A benchmark revealing pitfalls in PerspectiveAPI. In Laura Biester, Dorottya Demszky, Zhijing Jin, Mrinmaya Sachan, Joel Tetreault, Steven Wilson, Lu Xiao, and Jieyu Zhao, editors, Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI), pages 15–24, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.nlp4pi-1.2. URL https://aclanthology.org/2022.nlp4pi-1.2.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Constitutional ai: Harmlessness from ai feedback, 2022a.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions, 2024.
- Realchat-1m: A large-scale real-world LLM conversation dataset. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=BOfDKxfwt0.
- (inthe)wildchat: 570k chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
- Nathalie A Smuha. Beyond the individual: governing ai’s societal harm. Internet Policy Review, 10(3), 2021.
- Assessing language model deployment with risk cards, 2023.
- Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 723–741, 2023.
- Janna Hastings. Preventing harm from non-conscious bias in medical generative ai. The Lancet Digital Health, 6(1):e2–e3, 2024.
- Unequal Risk, Unequal Reward: How Gen AI disproportionately harms countries. https://www.oii.ox.ac.uk/news-events/unequal-risk-unequal-reward-how-gen-ai-disproportionately-harms-countries/, nov 8 2023. [Online; accessed 2024-04-13].
- DALL-Eval: Probing the reasoning skills and social biases of text-to-image generative transformers. In International Conference of Computer Vision, 2023.
- Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1493–1504, 2023.
- Cognitive bias in high-stakes decision-making with llms. arXiv preprint arXiv:2403.00811, 2024.
- The 4cs: Classifying online risk to children, 2021.
- Understanding online hate: Vsp regulation and the broader context. Turing Institute, February, https://www. ofcom. org. uk/__data/assets/pdf_file/0022/216490/alan-turing-institute-report-understanding-online-hate. pdf. Accessed, 9, 2021.
- Heather Frase Mia Hoffmann. Adding structure to ai harm. an introduction to cset’s ai harm framework, 2023. URL https://cset.georgetown.edu/publication/adding-structure-to-ai-harm/.
- Jeremy Waldron. Dignity and defamation: The visibility of hate. Harv. L. Rev., 123:1596, 2009.
- Evidencing the harms of hate speech. Social Identities, 22(3):324–341, 2016. doi: 10.1080/13504630.2015.1128810. URL https://doi.org/10.1080/13504630.2015.1128810.
- Suicide and self-harm content on instagram: A systematic scoping review. PLOS ONE, 15(9):1–16, 09 2020. doi: 10.1371/journal.pone.0238603. URL https://doi.org/10.1371/journal.pone.0238603.
- Entanglements and exploits: Sociotechnical security as an analytic framework. In 9th USENIX Workshop on Free and Open Communications on the Internet (FOCI 19), 2019.
- Julian Hazell. Spear phishing with large language models, 2023.
- Generative ai and intelligence assessment. The RUSI Journal, pages 1–10, 2023.
- Miron Lakomy. Artificial intelligence as a terrorism enabler? understanding the potential impact of chatbots and image generators on online terrorist activities. Studies in Conflict & Terrorism, 0(0):1–21, 2023. doi: 10.1080/1057610X.2023.2259195. URL https://doi.org/10.1080/1057610X.2023.2259195.
- Release strategies and the social impacts of language models, 2019.
- Can llm-generated misinformation be detected? arXiv preprint arXiv: 2309.13788, 2023a.
- Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv: 2311.05656, 2023b.
- Generative ml and csam: Implications and mitigations, 2023.
- Chatbots and mental health: insights into the safety of generative ai. Journal of Consumer Psychology, 2022.
- Emilio Ferrara. Genai against humanity: nefarious applications of generative artificial intelligence and large language models. Journal of Computational Social Science, February 2024. ISSN 2432-2725. doi: 10.1007/s42001-024-00250-1. URL http://dx.doi.org/10.1007/s42001-024-00250-1.
- Abusegpt: Abuse of generative ai chatbots to create smishing campaigns, 2024.
- Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities, 2023.
- Felm: Benchmarking factuality evaluation of large language models, 2023.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023.
- The New Yorker. The terrifying a.i. scam that uses your loved one’s voice, 2024. URL https://www.newyorker.com/science/annals-of-artificial-intelligence/the-terrifying-ai-scam-that-uses-your-loved-ones-voice.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, June 2016. URL https://www.microsoft.com/en-us/research/publication/quantifying-reducing-stereotypes-word-embeddings/.
- Li Lucy and David Bamman. Gender and representation bias in GPT-3 generated stories. In Nader Akoury, Faeze Brahman, Snigdha Chaturvedi, Elizabeth Clark, Mohit Iyyer, and Lara J. Martin, editors, Proceedings of the Third Workshop on Narrative Understanding, pages 48–55, Virtual, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.nuse-1.5. URL https://aclanthology.org/2021.nuse-1.5.
- Marked personas: Using natural language prompts to measure stereotypes in language models. arXiv preprint arXiv:2305.18189, 2023.
- Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b, 2024.
- Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023.
- The ethics of interaction: Mitigating security threats in llms, 2024.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
- Current and near-term ai as a potential existential risk factor. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’22. ACM, July 2022. doi: 10.1145/3514094.3534146. URL http://dx.doi.org/10.1145/3514094.3534146.
- An overview of catastrophic ai risks, 2023.
- Atoosa Kasirzadeh. Two types of ai existential risk: Decisive and accumulative, 2024.
- Harms from increasingly agentic algorithmic systems. In 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23. ACM, June 2023. doi: 10.1145/3593013.3594033. URL http://dx.doi.org/10.1145/3593013.3594033.
- Model evaluation for extreme risks, 2023.
- Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
- Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024.
- Evaluating frontier models for dangerous capabilities, 2024.
- Towards understanding sycophancy in language models, 2023.
- The risks associated with artificial general intelligence: A systematic review. Journal of Experimental & Theoretical Artificial Intelligence, 35(5):649–663, 2023.
- Gpqa: A graduate-level google-proof q&a benchmark, 2023.
- Auditing large language models: a three-layered approach. AI and Ethics, May 2023. ISSN 2730-5961. doi: 10.1007/s43681-023-00289-2. URL http://dx.doi.org/10.1007/s43681-023-00289-2.
- Towards ai safety: A taxonomy for ai system evaluation, 2024.
- Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022b. URL https://api.semanticscholar.org/CorpusID:248118878.
- Into the laions den: Investigating hate in multimodal datasets, 2023.
- Toward trustworthy ai development: Mechanisms for supporting verifiable claims, 2020.
- Software verification and validation of safe autonomous cars: A systematic literature review. IEEE Access, 9:4797–4819, 2021. doi: 10.1109/ACCESS.2020.3048047.
- How to certify machine learning based safety-critical systems? a systematic literature review. Automated Software Engineering, 29(2), April 2022. ISSN 1573-7535. doi: 10.1007/s10515-022-00337-x. URL http://dx.doi.org/10.1007/s10515-022-00337-x.
- Thomas G. Dietterich. Steps toward robust artificial intelligence. AI Magazine, 38(3):3–24, Oct. 2017. doi: 10.1609/aimag.v38i3.2756. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2756.
- Croissant: A metadata format for ml-ready datasets, 2024.
- Dmlr: Data-centric machine learning research – past, present and future, 2023.
- Dices dataset: Diversity in conversational ai evaluation for safety, 2023.
- Context Fund Policy Working Group. NTIA Open Weights Response: Towards A Secure Open Society Powered By Personal AI, 2024. URL https://www.context.fund/policy/ntia_open_weights_response.html.
- A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415):295–298, 2012.
- Ai-mediated communication: How the perception that profile text was written by ai affects trustworthiness. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–13, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450359702. doi: 10.1145/3290605.3300469. URL https://doi.org/10.1145/3290605.3300469.
- Algorithmic extremism: Examining youtube’s rabbit hole of radicalization, 2019.
- Algorithmic content moderation: Technical and political challenges in the automation of platform governance. Big Data & Society, 7(1):2053951719897945, 2020. doi: 10.1177/2053951719897945. URL https://doi.org/10.1177/2053951719897945.
- Artificial intelligence in communication impacts language and social relationships. arxiv. arXiv preprint arXiv:2102.05756, 2021.
- Causally estimating the effect of youtube’s recommender system using counterfactual bots. Proceedings of the National Academy of Sciences, 121(8):e2313377121, 2024. doi: 10.1073/pnas.2313377121. URL https://www.pnas.org/doi/abs/10.1073/pnas.2313377121.
- Chatgpt: Fundamentals, applications and social impacts. In 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 1–8, 2022. doi: 10.1109/SNAMS58071.2022.10062688.
- Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901.
- On the societal impact of open foundation models, 2024b.
- The shifted and the overlooked: A task-oriented investigation of user-gpt interactions, 2023.
- Ml4h auditing: From paper to practice. In Emily Alsentzer, Matthew B. A. McDermott, Fabian Falck, Suproteem K. Sarkar, Subhrajit Roy, and Stephanie L. Hyland, editors, Proceedings of the Machine Learning for Health NeurIPS Workshop, volume 136 of Proceedings of Machine Learning Research, pages 280–317. PMLR, 11 Dec 2020. URL https://proceedings.mlr.press/v136/oala20a.html.
- Governing ai safety through independent audits. Nature Machine Intelligence, 3(7):566–571, 2021.
- Heidy Khlaaf. Toward comprehensive risk assessments and assurance of ai-based systems. Trail of Bits, 2023.
- A causal framework for ai regulation and auditing, 2024.
- A Framework for Assurance Audits of Algorithmic Systems. arXiv preprint arXiv:2401.14908, Forthcoming. URL http://arxiv.org/abs/2401.14908. arXiv:2401.14908 [cs].
- Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency, pages 59–68, 2019.
- A relationship and not a thing: A relational approach to algorithmic accountability and assessment documentation. arXiv preprint arXiv:2203.01455, 2022.
- Ethics-based auditing to develop trustworthy ai. Minds and Machines, 31(2):323–327, 2021.
- Machine learning for health: algorithm auditing & quality control. Journal of medical systems, 45:1–8, 2021.
- Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai, 2019.
- Trustllm: Trustworthiness in large language models, 2024.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2024.
- Rohan Anil et al. Palm 2 technical report, 2023.
- Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nature Communications, 13(1), November 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-34591-0. URL http://dx.doi.org/10.1038/s41467-022-34591-0.
- Joseph R Biden. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence, 2023.
- Jailbreaking leading safety-aligned llms with simple adaptive attacks, 2024.
- Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition, 2024.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024.
- Generative ai red teaming challenge: transparency report, 2024. URL https://drive.google.com/file/d/1JqpbIP6DNomkb32umLoiEPombK2-0Rc-/view.
- Mlperf training benchmark, 2020.
- The ladder: A reliable leaderboard for machine learning competitions. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1006–1014, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/blum15.html.
- Towards ecologically valid research on language user interfaces, 2020.
- What will it take to fix benchmarking in natural language understanding?, 2021.
- Dynabench: Rethinking benchmarking in nlp, 2021.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023.
- Red-teaming large language models using chain of utterances for safety-alignment, 2023.
- Metr example task suite, public. https://github.com/METR/public-tasks, 2024.
- ActiveFence. Activefence safety api, 2024. URL https://www.activefence.com/active-score/.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment, 2024.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
- Unitary AI. Unitary ai, detoxify, 2021. URL https://github.com/unitaryai/detoxify.
- SalesForce. Auditnlg: Auditing generative ai language modeling for trustworthiness, 2023. URL https://github.com/salesforce/AuditNLG.
- Google Vertex AI. Configure safety settings for the palm api, 2024a. URL https://cloud.google.com/vertex-ai/generative-ai/docs/configure-safety-attributes-palm.
- Hive AI. Content moderation ai, 2024b. URL https://thehive.ai/.
- A holistic approach to undesired content detection in the real world, 2023.
- Microsoft Azure AI. Content safety filters, 2024c. URL https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety.
- Writer AI. Toxic check, 2024d. URL https://dev.writer.com/docs/toxic-check.
- The oxford handbook of terrorism, 2019. URL https://books.google.co.uk/books?hl=en&lr=&id=lu-MDwAAQBAJ&.
- John Horgan Donald Holbrook. Terrorism and ideology: Cracking the nut, 2019. URL https://www.jstor.org/stable/26853737?seq=6.
- Bertie Vidgen (35 papers)
- Adarsh Agrawal (1 paper)
- Ahmed M. Ahmed (5 papers)
- Victor Akinwande (9 papers)
- Namir Al-Nuaimi (1 paper)
- Najla Alfaraj (1 paper)
- Elie Alhajjar (10 papers)
- Lora Aroyo (35 papers)
- Trupti Bavalatti (3 papers)
- Borhane Blili-Hamelin (10 papers)
- Kurt Bollacker (5 papers)
- Rishi Bomassani (1 paper)
- Marisa Ferrara Boston (3 papers)
- Siméon Campos (4 papers)
- Kal Chakra (1 paper)
- Canyu Chen (26 papers)
- Cody Coleman (10 papers)
- Zacharie Delpierre Coudert (4 papers)
- Leon Derczynski (48 papers)
- Debojyoti Dutta (14 papers)