Papers
Topics
Authors
Recent
2000 character limit reached

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance (2401.03729v3)

Published 8 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,'' practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. In this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the LLM? We answer this using a series of prompt variations across a variety of text classification tasks. We find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the LLM to change its answer. Further, we find that requesting responses in XML and commonly used jailbreaks can have cataclysmic effects on the data labeled by LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
  2. Issa Annamoradnejad and Gohar Zoghi. 2022. Colbert: Using bert sentence embedding in parallel neural networks for computational humor.
  3. Jigsaw unintended bias in toxicity classification.
  4. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  5. Are large language model-based evaluators the solution to scaling up multilingual evaluation? arXiv preprint arXiv:2309.07462.
  6. Warp: Word-level adversarial reprogramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4921–4933.
  7. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  8. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861.
  9. Race: Large-scale reading comprehension dataset from examinations.
  10. Making large language models better data creators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15349–15360.
  11. The hitchhiker’s guide to program analysis: A journey with large language models. arXiv e-prints, pages arXiv–2308.
  12. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  13. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  14. SemEval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 31–41, San Diego, California. Association for Computational Linguistics.
  15. Silviu Oprea and Walid Magdy. 2020. iSarcasm: A dataset of intended sarcasm. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1279–1289, Online. Association for Computational Linguistics.
  16. Guanghui Qin and Jason Eisner. 2021. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  17. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  18. Timo Schick and Hinrich Schütze. 2020. Few-shot text generation with pattern-exploiting training. arXiv e-prints, pages arXiv–2012.
  19. Quantifying social biases using templates is unreliable. arXiv preprint arXiv:2210.04337.
  20. Superglue: A stickier benchmark for general-purpose language understanding systems.
  21. Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  22. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145.
  23. Guido Zuccon and Bevan Koopman. 2023. Dr chatgpt, tell me what i want to hear: How prompt knowledge impacts health answer correctness. arXiv e-prints, pages arXiv–2302.
Citations (29)

Summary

  • The paper demonstrates that minor prompt alterations can change up to 10% of predictions, highlighting significant sensitivity in LLM performance.
  • The study systematically evaluates 24 prompt variations across 11 tasks using models like ChatGPT and Llama 2 to assess impacts on accuracy and prediction similarity.
  • The findings imply that while larger models are more robust, refined prompt engineering is crucial for ensuring reliable LLM performance in real-world applications.

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect LLM Performance

Introduction

The paper "The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect LLM Performance" explores the sensitivity of LLMs to variations in prompt construction. Despite the growing use of LLMs in text classification tasks, little attention has been paid to how susceptible these models are to minor alterations in prompts. The authors systematically evaluate the impact of prompt variations across 11 text classification tasks, revealing surprising levels of variability in model performance due to even trivial changes.

Methodology

The authors conduct their experiments using OpenAI's ChatGPT and Llama 2 models of varying sizes (7B, 13B, and 70B). They explore 24 types of prompt variations categorized into Output Formats, Perturbations, Jailbreaks, and Tipping. The experiments are designed to assess how these prompt variations influence model predictions and accuracy. Each task involves 11,000 samples, and the variations include different output formats like Python list, JSON, and YAML, minor perturbations such as adding spaces or thank you notes, and commonly used jailbreak techniques.

Results

Sensitivity to Prompt Variations

The findings indicate that minor prompt variations significantly impact model predictions. For instance, specifying an output format leads to changes in a minimum of 10% of predictions. Notably, even adding spaces or common greetings to prompts alters a considerable number of predictions. The paper observes that larger models tend to be more robust to these variations, highlighting the propensity of smaller models to rely more on spurious correlations. Figure 1

Figure 1: Number of predictions that change (out of 11,000) compared to No Specified Format style. Red bars correspond to the number of invalid responses provided by the model.

Impact on Model Accuracy

Table \ref{tab:multi-model-accuracy-scores} from the paper showcases how different prompt strategies affect model accuracy. While output formats significantly impact accuracy, no single format consistently outperforms others across all tasks. JSON and Python list formats typically yield higher accuracy, but models show variability in performance based on the complexity and nature of tasks.

Prediction Similarity

The paper employs multidimensional scaling (MDS) to visualize prediction similarity across variations. Clusters formed by perturbations indicate semantic similarity in outputs, while jailbreaks, due to their directive nature, result in a wider distribution of predictions.

Annotator Agreement Correlation

Interestingly, there is a counterintuitive weak negative correlation between the entropy of annotator predictions (a measure of annotator disagreement) and instances that change predictions when prompt variations are applied. This suggests that the changes in predictions do not purely arise from the intrinsic difficulty of the task as perceived by human annotators.

Implications and Future Work

The results imply that practitioners should be wary of over-reliance on LLMs for sensitive applications where minor changes in prompt construction could lead to significant variability in model outputs. Identifying stable prompt constructs and understanding their relationship to model performance is crucial for deploying LLMs reliably.

Future work could explore developing LLMs resilient to such perturbations and investigating the interplay between prompt construction and model architecture to enhance robustness. This research opens avenues for refining prompt engineering practices, which could mitigate the impact of prompt sensitivity in LLM deployment.

Conclusion

The paper highlights a critical aspect of LLM performance variability under prompt variations, calling for a deeper understanding of prompt constructions. By documenting the butterfly effect of minor alterations, it provides valuable insights for improving the stability and reliability of LLM applications across real-world tasks.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 36 likes about this paper.