PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts (2306.04528v5)

Published 7 Jun 2023 in cs.CL, cs.CR, and cs.LG

Abstract: The increasing reliance on LLMs across academia and industry necessitates a comprehensive understanding of their robustness to prompts. In response to this vital need, we introduce PromptRobust, a robustness benchmark designed to measure LLMs' resilience to adversarial prompts. This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. The adversarial prompts, crafted to mimic plausible user errors like typos or synonyms, aim to evaluate how slight deviations can affect LLM outcomes while maintaining semantic integrity. These prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our study generates 4,788 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets. Our findings demonstrate that contemporary LLMs are not robust to adversarial prompts. Furthermore, we present a comprehensive analysis to understand the mystery behind prompt robustness and its transferability. We then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users.

Authors (11)

Kaijie Zhu (19 papers)
Jindong Wang (150 papers)
Jiaheng Zhou (2 papers)
Zichen Wang (47 papers)
Hao Chen (1006 papers)
Yidong Wang (43 papers)
Linyi Yang (52 papers)
Wei Ye (110 papers)
Yue Zhang (620 papers)
Neil Zhenqiang Gong (117 papers)
Xing Xie (220 papers)

Citations (130)

View on Semantic Scholar

Summary

The paper introduces PromptBench to evaluate LLM robustness against adversarial prompts using character to semantic-level manipulations.
It demonstrates that word-level attacks cause an average 39% performance drop across diverse tasks like sentiment analysis and translation.
It highlights the need for enhanced defense strategies such as adversarial training and ensemble methods to improve LLM resilience.

Evaluating Robustness of LLMs Against Adversarial Prompts: Insights from PromptBench

The advancement in LLMs has increasingly seen them being integrated across sectors ranging from academia to critical decision-making industries. This widespread application accentuates the necessity to comprehend the robustness of LLMs under adversarial conditions, particularly in the field of prompt-based interactions. This paper presents "PromptBench," a benchmark specifically constructed to scrutinize LLM performance against adversarially manipulated prompts.

Overview

PromptBench meticulously evaluates the susceptibilities of LLMs by generating an array of adversarial prompts at different granularity levels: character, word, sentence, and semantic. The benchmark provides a comprehensive overview through an extensive evaluation involving 4,788 crafted adversarial prompts across various tasks including sentiment analysis, natural language inference, and machine translation, highlighting notable vulnerabilities in current LLM frameworks.

Methodology and Findings

The authors categorize and test prompts across four types: zero-shot task-oriented, zero-shot role-oriented, few-shot task-oriented, and few-shot role-oriented. Adversarial attacks utilized include character-level manipulations (TextBugger, DeepWordBug), word-level substitutions (BertAttack, TextFooler), sentence-level disruptions (StressTest, CheckList), and semantic-level modifications. Through these robust evaluations across multiple renowned LLMs such as ChatGPT, GPT-4, and Flan-T5-large, it is observed that LLMs exhibit a pronounced lack of robustness to these adversarial prompts. For instance, word-level attacks cause an average 39% performance decrement across all tasks, underscoring the need for resilience enhancements.

Implications and Future Directions

This investigation not only identifies vulnerabilities but also contributes valuable insights into the processing flaws within the LLMs. By understanding these weaknesses through attention visualization and transferability analysis, the research takes a step towards developing methods that can potentially shield LLMs from adversarial exploitation. The transferability findings accentuate adversarial prompts' limitations in moving across models, opening avenues for improving robustness through ensemble approaches and adversarial training.

Moreover, the benchmark extends an invitation for future research to employ PromptBench to evaluate emerging LLMs and potentially refine adversarial resistance strategies, including the innovative application of fine-tuning paradigms, prompted semantic translations, and robust prompt engineering methodologies.

Conclusion

PromptBench emerges as a seminal contribution, bridging gaps in LLM evaluations under adversarial conditions by focusing on prompt-based attacks. It lays the groundwork for ongoing enhancements in AI robustness, underscoring the importance of resilient design in LLMs amidst increasingly sophisticated adversarial challenges. As the field progresses, embracing such benchmarks will be vital in advancing LLM robustness to withstand practical, real-world applications and ensuring secure integration across diverse technological landscapes.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/promptbench: A unified evaluation framework for large language models (2,375 stars)

Tweets

https://twitter.com/tuturetom/status/1784054554646151402

YouTube

Show All Videos