Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

A Safe Harbor for AI Evaluation and Red Teaming (2403.04893v1)

Published 7 Mar 2024 in cs.AI

Abstract: Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems. However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. Although some companies offer researcher access programs, they are an inadequate substitute for independent research access, as they have limited community representation, receive inadequate funding, and lack independence from corporate incentives. We propose that major AI developers commit to providing a legal and technical safe harbor, indemnifying public interest safety research and protecting it from the threat of account suspensions or legal reprisal. These proposals emerged from our collective experience conducting safety, privacy, and trustworthiness research on generative AI systems, where norms and incentives could be better aligned with public interests, without exacerbating model misuse. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (138)
  1. Research threats: Legal threats against security researchers. https://github.com/disclose/research-threats, 2021.
  2. A safe harbor for platform research. Knight Columbia, 1 2022. URL https://knightcolumbia.org/content/a-safe-harbor-for-platform-research.
  3. Ada Lovelace Institute. Post-summit civil society communique, 11 2023. URL https://www.adalovelaceinstitute.org/news/post-summit-civil-society-communique/.
  4. Bug hunters’ perspectives on the challenges and benefits of the bug bounty ecosystem. In 32nd USENIX Security Symposium (USENIX Security). https://doi. org/10.48550/arXiv, volume 2301, 2023.
  5. Frontier ai regulation: Managing emerging risks to public safety, 2023.
  6. Anthropic. Core views on ai safety: When, why, what, and how. https://www.anthropic.com/news/core-views-on-ai-safety, 3 2023a.
  7. Anthropic. Frontier threats red teaming for ai safety. Anthropic, 7 2023b. URL https://www.anthropic.com/index/frontier-threats-red-teaming-for-ai-safety.
  8. Anthropic. Responsible disclosure policy, December 2023. URL https://www.anthropic.com/responsible-disclosure-policy.
  9. Barclay, L. Facebook banned me for life because i help people use it less, 10 2021. URL https://slate.com/technology/2021/10/facebook-unfollow-everything-cease-desist.html.
  10. Barrabi, T. Sam altman — who warned ai poses ‘risk of extinction’ to humanity — is also a ‘doomsday prepper’. New York Post, 6 2023. URL https://nypost.com/2023/06/05/sam-altman-who-warned-ai-poses-risk-of-extinction-to-humanity-is-also-a-doomsday-prepper/.
  11. Belanger, A. 100+ researchers say they stopped studying x, fearing elon musk might sue them. https://arstechnica.com/tech-policy/2023/11/100-researchers-say-they-stopped-studying-x-fearing-elon-musk-might-sue-them/, 11 2023.
  12. Ai auditing: The broken bus on the road to ai accountability, 2024.
  13. Blog, G. Rebooting responsible disclosure: a focus on protecting end users. https://security.googleblog.com/2010/07/rebooting-responsible-disclosure-focus.html, 7 2010.
  14. Emergent autonomous scientific research capabilities of large language models, 2023.
  15. The foundation model transparency index, 2023a.
  16. Improving transparency in ai language models: A holistic evaluation. Foundation Model Issue Brief Series, 2023b. URL https://hai.stanford.edu/foundation-model-issue-brief-series.
  17. Commission on information disorder final report. Technical report, Aspen Institute, November 2021. URL https://www.aspeninstitute.org/wp-content/uploads/2021/11/Aspen-Institute_Commission-on-Information-Disorder_Final-Report.pdf. Recommendations for transparency.
  18. Brittain, B. OpenAI says New York Times ’hacked’ ChatGPT to build copyright lawsuit. Reuters, Feb 2024. URL https://www.reuters.com/technology/cybersecurity/openai-says-new-york-times-hacked-chatgpt-build-copyright-lawsuit-2024-02-27/.
  19. Brodkin, J. Missouri threatens to sue a reporter who flagged a security flaw. https://www.wired.com/story/missouri-threatens-sue-reporter-state-website-security-flaw/, 10 2021.
  20. Structured access for third-party research on frontier ai models: Investigating researchers’ model access requirements, 2023. URL https://www.governance.ai/research-paper/structured-access-for-third-party-research-on-frontier-ai-models.
  21. Bugcrowd. Vulnerability disclosure policy: What is it & why is it important? Bugcrowd Blog, 12 2023. URL https://www.bugcrowd.com/blog/vulnerability-disclosure-policy-what-is-it-why-is-it-important/.
  22. Artificial influence: An analysis of ai-driven persuasion. arXiv preprint arXiv:2303.08721, 2023.
  23. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
  24. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023.
  25. Black-box access is insufficient for rigorous ai audits, 2024.
  26. CFAA. Computer Fraud and Abuse Act. 18 U.S.C. § 1030, 1986.
  27. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  28. Colannino, J. The copyright office expands your security research rights. https://github.blog/2021-11-23-copyright-office-expands-security-research-rights/, 23 2021.
  29. Commission, F. T. The ftc voice cloning challenge. https://www.ftc.gov/news-events/contests/ftc-voice-cloning-challenge, 2023.
  30. Who audits the auditors? recommendations from a field scan of the algorithmic auditing ecosystem. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pp.  1571–1583, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533213. URL https://doi.org/10.1145/3531146.3533213.
  31. DeLong, L. A. Facebook disables ad observatory; academicians and journalists fire back. NYU Center for Cybersecurity, 8 2021. URL https://cyber.nyu.edu/2021/08/21/facebook-disables-ad-observatory-academicians-and-journalists-fire-back/.
  32. Department of Justice. Department of justice announces new policy for charging cases under the computer fraud and abuse act. Press Release, 5 2022. URL https://www.justice.gov/opa/pr/department-justice-announces-new-policy-charging-cases-under-computer-fraud-and-abuse-act.
  33. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  34. It’s time to open the black box of social media. https://www.scientificamerican.com/article/its-time-to-open-the-black-box-of-social-media/, 4 2022. 5 min read.
  35. DMCA. Digital Millennium Copyright Act. 17 U.S.C. § 1201, 1998a.
  36. DMCA. Digital millennium copyright act. 17 U.S.C. § 1204(a), 1998b.
  37. Douglas Heaven, W. How to make a chatbot that isn’t racist or sexist. MIT Technology Review, 10 2020. URL https://www.technologyreview.com/2020/10/23/1011116/chatbot-gpt3-openai-facebook-google-safety-fix-racist-sexist-language-ai/.
  38. Elazari, A. We Need Bug Bounties for Bad Algorithms, May 2018a. URL https://www.vice.com/en/article/8xkyj3/we-need-bug-bounties-for-bad-algorithms.
  39. Elazari, A. Hacking the law: Are bug bounties a true safe harbor? In Enigma 2018 (Enigma 2018), 2018b.
  40. Elazari, A. Private ordering shaping cybersecurity policy: The case of bug bounties. An edited, final version of this paper in Rewired: Cybersecurity Governance, Ryan Ellis and Vivek Mohan eds. Wiley, 2019.
  41. Coming in from the cold: A safe harbor from the cfaa and the dmca §1201 for security researchers. Berkman Klein Center Research Publication No. 2018-4. Assembly Publication Series, Berkman Klein Center for Internet & Society, Harvard University, 2018. URL http://nrs.harvard.edu/urn-3:HUL.InstRepos:37135306.
  42. European Council. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, 2024. URL https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf.
  43. Is tricking a robot hacking? Berkeley Technology Law Journal, 34(3):891–918, 2019.
  44. Executive Office of the President. Safe, secure, and trustworthy development and use of artificial intelligence. Executive Order, 10 2023. URL https://www.federalregister.gov/documents/2023/10/30/2023-24110/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence. Federal Register Vol. 88, No. 210 (October 30, 2023).
  45. Llm agents can autonomously hack websites. arXiv preprint arXiv:2402.06664, 2024.
  46. Ai red-teaming is not a one-stop solution to ai harms: Recommendations for using red-teaming for ai accountability. Data & Society, 10 2023. URL https://datasociety.net/library/ai-red-teaming-is-not-a-one-stop-solution-to-ai-harms-recommendations-for-using-red-teaming-for-ai-accountability/.
  47. Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689, 2023.
  48. Generative ai has an intellectual property problem. Harvard Business Review, 04 2023. URL https://hbr.org/2023/04/generative-ai-has-an-intellectual-property-problem.
  49. Asymmetric ideological segregation in exposure to political news on facebook. Science, 381(6656):392–398, 2023. doi: 10.1126/science.ade7138. URL https://www.science.org/doi/abs/10.1126/science.ade7138.
  50. Greene, T. C. Sdmi cracks revealed. https://www.theregister.com/2001/04/23/sdmi_cracks_revealed/, 4 2001.
  51. The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work. The New York Times, Dec 2023. URL https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html.
  52. Gupta, R. Laion and the challenges of preventing ai-generated csam. https://www.techpolicy.press/laion-and-the-challenges-of-preventing-ai-generated-csam/, 1 2024.
  53. Hacker, P. Comments on the final trilogue version of the ai act, January 2023. URL https://media.licdn.com/dms/document/media/D4E1FAQE9w01juCUvIw/feedshare-document-pdf-analyzed/0/1706022316786?e=1707350400&v=beta&t=PQMy2m6nOfRLfkHd4pO-ZJ0JJWvehexHNLmWJLgLYrA.
  54. HackerOne. Hackerone gold standard safe harbor. HackerOne, 2023. URL https://hackerone.com/security/safe_harbor.
  55. Introducing google’s secure ai framework, June 2023. URL https://blog.google/technology/safety-security/introducing-googles-secure-ai-framework/.
  56. Foundation models and fair use. arXiv preprint arXiv:2303.15715, 2023.
  57. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023.
  58. The facebook files: A wall street journal investigation. https://www.wsj.com/articles/the-facebook-files-11631713039, 2021.
  59. Hu, K. Chatgpt sets record for fastest-growing user base - analyst note. Reuters, February 2023. URL https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/.
  60. Catastrophic jailbreak of open-source llms via exploiting generation, 2023a.
  61. Privacy implications of retrieval-based language models. arXiv preprint arXiv:2305.14888, 2023b.
  62. Inflection. Our policy on frontier safety, 2023. URL https://inflection.ai/frontier-safety.
  63. Innovation, Science and Economic Development Canada. Voluntary code of conduct on the responsible development and management of advanced generative ai systems, September 2023. URL https://ised-isde.canada.ca/site/ised/en/voluntary-code-conduct-responsible-development-and-management-advanced-generative-ai-systems.
  64. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  65. Jonathan, S. Ny times sues openai, microsoft for infringing copyrighted works. Reuters, 12 2023. URL https://www.reuters.com/legal/transactional/ny-times-sues-openai-microsoft-infringing-copyrighted-work-2023-12-27/.
  66. On the societal impact of open foundation models. 2024.
  67. Bug bounties for algorithmic harms?, 2022. URL https://www.ajl.org/bugs.
  68. Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
  69. Krawiec, K. D. Cosmetic compliance and the failure of negotiated governance. Wash. ULQ, 81:487, 2003.
  70. Lakatos, S. A revealing picture: Ai-generated ‘undressing’ images move from niche pornography discussion forums to a scaled and monetized online business. Technical report, Graphika, Dec 2023. URL https://public-assets.graphika.com/reports/graphika-report-a-revealing-picture.pdf.
  71. Lambert, N. Undoing rlhf and the brittleness of safe llms, 10 2023. URL https://www.interconnects.ai/p/undoing-rlhf.
  72. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023a.
  73. Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. arXiv preprint arXiv:2304.14475, 2023b.
  74. The time is now to develop community norms for the release of foundation models, 2022. URL https://crfm.stanford.edu/2022/05/17/community-norms.html.
  75. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW. Featured Certification, Expert Certification.
  76. Goal-oriented prompt attack and safety evaluation for llms. arXiv e-prints, pp.  arXiv–2309, 2023.
  77. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787, 2023.
  78. Generative ai has a visual plagiarism problem. IEEE Spectrum, 1 2024. URL https://spectrum.ieee.org/midjourney-copyright.
  79. Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
  80. Meta. Overview of meta ai safety policies prepared for the uk ai safety summit, 2023. URL https://transparency.fb.com/en-gb/policies/ai-safety-policies-for-safety-summit/.
  81. Midjourney. Terms of service, December 2023. URL https://docs.midjourney.com/docs/terms-of-service.
  82. Conflicts of interest and the case of auditor independence: Moral seduction and strategic issue cycling. Academy of management review, 31(1):10–29, 2006.
  83. Mozilla. How safe are our online platforms? let’s open the door for social media researchers. https://foundation.mozilla.org/en/campaigns/unknown-influence/, 2023.
  84. Model alignment protects against accidental harms, not intentional ones, 12 2023a. URL https://www.aisnakeoil.com/p/model-alignment-protects-against.
  85. Generative ai companies must publish transparency reports, 2023b. URL https://knightcolumbia.org/blog/generative-ai-companies-must-publish-transparency-reports.
  86. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  87. National Science Foundation. Democratizing the future of ai r&d: Nsf to launch national ai research resource pilot. https://new.nsf.gov/news/democratizing-future-ai-rd-nsf-launch-national-ai, 1 2024.
  88. https://www.longtermresilience.org/post/report-launch-examining-risks-at-the-intersection-of-ai-and-bio, 10 2023. URL https://www.longtermresilience.org/post/report-launch-examining-risks-at-the-intersection-of-ai-and-bio.
  89. NIST. Nist seeks collaborators for consortium supporting artificial intelligence safety, 2023. URL https://www.nist.gov/news-events/news/2023/11/nist-seeks-collaborators-consortium-supporting-artificial-intelligence.
  90. NIST. Test, evaluation & red-teaming, 2024. URL https://www.nist.gov/artificial-intelligence/executive-order-safe-secure-and-trustworthy-artificial-intelligence/test.
  91. OpenAI. Introducing chatgpt and whisper apis. 2023a. URL https://openai.com/blog/introducing-chatgpt-and-whisper-apis.
  92. OpenAI. Sharing and publication policy. https://openai.com/policies/sharing-publication-policy#research, 2023b.
  93. OpenAI. Researcher access program application, 2024. URL https://openai.com/form/researcher-access-program.
  94. An attacker’s dream? exploring the capabilities of chatgpt for developing malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop, pp.  10–18, 2023.
  95. Supporting youth mental and sexual health information seeking in the era of artificial intelligence (ai) based conversational agents: Current landscape and future directions. Available at SSRN 4601555, 2023.
  96. Adversarial nibbler: A data-centric challenge for improving the safety of text-to-image models. arXiv preprint arXiv:2305.14384, 2023.
  97. Persily, N. A proposal for researcher access to platform data: The platform transparency and accountability act. Journal of Online Trust and Safety, 1(1), 2021.
  98. Pfefferkorn, R. America’s anti-hacking laws pose a risk to national security. https://www.brookings.edu/articles/americas-anti-hacking-laws-pose-a-risk-to-national-security/, 9 2021.
  99. Pfefferkorn, R. Shooting the messenger: Remediation of disclosed vulnerabilities as cfaa “loss”. Richmond Journal of Law & Technology, 29:89, 2022. URL https://jolt.richmond.edu/files/2022/11/Pfefferkorn-Manuscript-Final.pdf.
  100. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  101. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. arXiv preprint arXiv:2305.13873, 2023.
  102. Outsider oversight: Designing a third party audit ecosystem for ai governance. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’22, pp.  557–571, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392471. doi: 10.1145/3514094.3534181. URL https://doi.org/10.1145/3514094.3534181.
  103. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
  104. From ChatGPT to HackGPT: Meeting the Cybersecurity Threat of Generative AI. MIT Sloan Management Review, 2023.
  105. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
  106. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
  107. Scalable and transferable black-box jailbreaks for language models via persona modulation. In Socially Responsible Language Modelling Research, 2023.
  108. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
  109. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  110. Detecting pretraining data from large language models. In ICLR, 2024.
  111. Can large language models democratize access to dual-use biotechnology? arXiv preprint arXiv:2306.03809, 2023.
  112. Evaluating the social impact of generative ai systems in systems and society, 2023.
  113. Stupp, C. Fraudsters used ai to mimic ceo’s voice in unusual cybercrime case. https://www.wsj.com/articles/fraudsters-used-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567098001, 8 2019. WSJ PRO.
  114. The Coming Wave: Technology, Power, and the Twenty-First Century’s Greatest Dilemma. Penguin Random House, 2023.
  115. Sven Cattell. Generative red team recap, Oct 2023. URL https://aivillage.org/defcon%2031/generative-recap/.
  116. Tabassi, E. Artificial intelligence risk management framework (ai rmf 1.0), 2023-01-26 05:01:00 2023. URL https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=936225.
  117. The Hacking Policy Council, Dec 2023. URL https://assets-global.website-files.com/62713397a014368302d4ddf5/6579fcd1b821fdc1e507a6d0_Hacking-Policy-Council-statement-on-AI-red-teaming-protections-20231212.pdf.
  118. Generative ml and csam: Implications and mitigations, 2023. URL https://fsi.stanford.edu/publication/generative-ml-and-csam-implications-and-mitigations.
  119. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  120. United States Office of Management and Budget. Advancing governance, innovation, and risk management for agency use of artificial intelligence, October 2023. URL https://www.whitehouse.gov/wp-content/uploads/2023/11/AI-in-Government-Memo-draft-for-public-review.pdf.
  121. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
  122. Towards a greater understanding of coordinated vulnerability disclosure policy documents. Digital Threats: Research and Practice, 2023.
  123. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  124. Sociotechnical safety evaluation of generative ai systems. ArXiv, abs/2310.11986, 2023. URL https://api.semanticscholar.org/CorpusID:264289156.
  125. Weiss, J. Petition for new exemption to section 1201 of the digital millenium copyright act: Exemption for security research pertaining to generative ai bias, June 2023. URL https://www.copyright.gov/1201/2024/petitions/proposed/New-Pet-Jonathan-Weiss.pdf.
  126. Whittaker, M. The steep cost of capture. Interactions, 28(6):50–55, 2021.
  127. Xiang, C. ’he would still be here’: Man dies by suicide after talking with ai chatbot, widow says. https://www.vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says, 3 2023.
  128. The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085, 2023.
  129. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  130. Low-resource languages jailbreak gpt-4. In Socially Responsible Language Modelling Research, 2023.
  131. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  132. Zalnieriute, M. “transparency-washing” in the digital age : A corporate agenda of procedural fetishism. Technical report, 2021. URL http://hdl.handle.net/11159/468588.
  133. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  134. Removing RLHF Protections in GPT-4 via Fine-Tuning. arXiv preprint arXiv:2311.05553, 2023.
  135. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
  136. Weak-to-strong jailbreaking on large language models, 2024.
  137. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
  138. Zuboff, S. The age of surveillance capitalism. In Social Theory Re-Wired, pp.  203–213. Routledge, 2023.
Citations (21)

Summary

  • The paper introduces dual safe harbor frameworks—legal and technical—to enable independent adversarial evaluation of generative AI systems.
  • It details mechanisms such as pre-registration and transparent appeals to counter restrictive access policies that impede comprehensive safety research.
  • The study underscores the need for aligned incentives and regulatory clarity to foster robust, transparent reviews and risk mitigation strategies in AI deployment.

A Safe Harbor for AI Evaluation and Red Teaming: Realigning Incentives for Independent Safety Research

Context and Motivation

The proliferation of generative AI systems has surfaced substantive risks across a wide sociotechnical front, including privacy violations, misinformation, toxicity, disinformation campaigns, fraud, and unsafe outputs spanning from bioweapon synthesis to self-harm instructions. Despite regulatory and academic consensus on the necessity of external scrutiny for deployed foundation models, leading AI companies have erected substantial legal and technical barriers to independent evaluation of their systems. Terms of service typically proscribe adversarial testing, jailbreaks, or the release of findings on model safety or vulnerabilities, with enforcement through account suspensions and implicit legal threat, thereby creating a tangible chilling effect.

While industry-controlled research access programs provide limited opportunities for third-party evaluation, they lack transparency, are highly selective, and are deeply misaligned with the norms of adversarial auditing and broad-based participation seen in cybersecurity. The result is a narrowing of the independent safety research pipeline, with commensurate negative effects on risk discovery, diversity of threat modeling, and public trust.

The central contribution is the articulation of two voluntary commitments—legal and technical safe harbors—as foundational enablers for robust, independent AI safety research.

Legal Safe Harbor:

This commitment requires AI companies to indemnify public interest researchers for good-faith vulnerability and safety research, provided the work complies with established norms of responsible disclosure and avoids harm to individuals or the public. Notably, the definition of "good faith" must not be at the sole discretion of the vendor, but tethered to ex post behavioral and procedural criteria. The legal landscape, especially in the US, presents acute risks under statutes like the CFAA and DMCA §1201, and the paper underscores the need for both company-originated safe harbor policies and legislative clarity (e.g., via NIST guidelines and DoJ charging policies).

Technical Safe Harbor:

Beyond legal protection, researchers require procedural guarantees that their accounts will not be summarily suspended during safety research performed within defined, transparent policies. The technical safe harbor involves two key scaling mechanisms: (1) pre-registration and trusted third-party delegation (e.g., via NAIRR, academic institutions, or accredited NGOs), and (2) transparent and independently reviewed appeals processes for suspended accounts. These measures are designed to distribute the operational and reputational risk of granting research access, and to check gatekeeping and favoritism tendencies inherent in present access program regimes. Figure 1

Figure 1: A summary of the mutual commitments and scope of a legal safe harbor and technical safe harbor framework for AI evaluation, including the relationships to existing security research and researcher access programs.

Structural and Practical Incentives Inhibiting Independent Research

AI developers’ policies are demonstrably misaligned with the needs of the security and AI safety research communities. Empirical observations in the paper highlight themes of:

  • Chilling Effects: Repeated suspensions and lack of appeals for researchers conducting adversarial or red teaming work. Incidents are documented, including bans following disclosure of plagiarism detection and copyright-related investigations, with tangible personal and professional costs.
  • Non-Uniform Access and Favoritism: Access to APIs and deeper system interfaces is often extended opaquely, with bias toward well-networked or institutionally favored parties, further skewing the research focus and diversity.
  • Ambiguity and Unclear Norms: Lacking prospective guarantees or clear, cross-company protocols for responsible disclosure, approval criteria, or engagement mechanisms.
  • Misaligned Incentives: Researchers gravitate toward less sensitive or lower-risk system flaws, leading both to overall under-reporting of critical vulnerabilities and to misattribution of harmful behaviors.
  • Inadequate Depth of Access for Evaluation: The contrast between open and closed models (e.g., Llama 2 vs. GPT-4) underscores the impact of technical restrictions on audit completeness, reproducibility, and interpretability.

These conditions, if left uncorrected, risk reproducing known failures of social media platforms in the domains of transparency, bias, and harm mitigation, as extensively documented in the literature on platform governance.

Implementation Considerations and Policy Implications

The proposals are grounded in parallels with the evolution of security research safe harbors (e.g., vulnerability disclosure, bug bounty models), but extended to the broader scope of AI safety, which encompasses sociotechnical and emergent harms not strictly reducible to classic security issues. They are structured around ex post assessment of researcher conduct, enabling pre-disclosure flexibility while retaining legal levers against genuinely malicious actors. The suggested intermediated model, leveraging entities such as universities or NAIRR, is already in service for platform data access in social media, and is credible as a mechanism for scalable researcher access in AI.

Key implementation details include:

  • Pre-registration Platforms: Comparable to OpenAI’s Researcher Access Program, but with streamlined policies for review and eligibility criteria independent of vendor favoritism.
  • Disclosure and Appeals Mechanisms: A well-defined, auditable process with deadlines, transparent decision-making, and escalation to external review if required.
  • Scope Management: Explicit ties of safe harbor eligibility to responsible disclosure practices, avoidance of privacy violations, and compliance with jurisdiction-specific regulation.
  • Residual Legal Gaps: Recognition that only civil liability can be waived by vendors, and the need to lobby for statutory safe harbors to fully protect researchers from criminal liability in frontier areas of AI system audit.

The paper contextualizes its recommendations within convergent external proposals (Hacking Policy Council, Abdo et al., Algorithmic Justice League) and emerging regulatory frameworks (EU AI Act, US Executive Order on AI, OMB directives, Canadian voluntary codes). These initiatives all seek to similarly broaden independent, third-party participation; harmonize standards for adversarial testing and audit; and render AI safety evaluation a first-class requirement in high-stakes model deployment.

Theoretical and Practical Implications

Contradictory findings and claims:

  • The analysis makes the explicit claim that current security-only safe harbors are fundamentally insufficient and that the scope must be extended to all forms of AI safety research, a stance with both legal and ethical ramifications for corporate governance.
  • The authors further emphasize that voluntary internal/industry audits cannot substitute for external adversarial investigation due to intrinsic conflicts of interest.
  • The assertion that safe harbor frameworks need not trigger increased model misuse, if tied to post-hoc (i.e., ex post) compliance with responsible disclosure, is a strong claim that challenges narratives around the necessity of restrictive moderation for platform safety.

From a theoretical standpoint, the proposal reinforces the position that external auditing functions are a necessary—though not sufficient—component of any AI risk governance regime, a point now recognized with increasing formalization in ongoing academic and policy literature. The approach also provides a repeatable and extensible template for other domains in which platform or model transparency must be balanced against proprietary risk and potential for abuse.

Anticipated Future Developments

Given the present trajectory of sociotechnical system regulation, it is likely that multi-jurisdictional safe harbor mechanisms will become central in frameworks for foundation model deployment, both through voluntary industry standardization and mandated statutory reform. The design of pre-registration, multi-stakeholder appeal mechanisms, and external review boards for access and enforcement will become increasingly sophisticated, and their operation will require ongoing input from both academic communities and civil society actors. Additionally, the boundaries of "good faith" AI safety research will be tested and iterated as the set of emerging risks from generative AI continues to outpace current regulatory regimes.

Conclusion

Legal and technical safe harbors are framed as necessary guardrails to realign the incentives and procedural affordances of generative AI development with the requirements of independent safety and trustworthiness research. Voluntary adoption of these frameworks by AI vendors—and/or their canonical enshrinement in national and transnational policy—will be critical to establishing participatory, accountable, and reproducible evaluation processes that are ultimately robust against both known and emergent AI risks. These steps are foundational for the pluralistic, global-scale oversight demanded by the societal penetration of highly capable generative AI.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.