Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 25 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 134 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image (2403.02910v3)

Published 5 Mar 2024 in cs.CV and cs.AI

Abstract: There has been an increasing interest in the alignment of LLMs with human values. However, the safety issues of their integration with a vision module, or vision LLMs (VLMs), remain relatively underexplored. In this paper, we propose a novel jailbreaking attack against VLMs, aiming to bypass their safety barrier when a user inputs harmful instructions. A scenario where our poisoned (image, text) data pairs are included in the training data is assumed. By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images. Moreover, we analyze the effect of poison ratios and positions of trainable parameters on our attack's success rate. For evaluation, we design two metrics to quantify the success rate and the stealthiness of our attack. Together with a list of curated harmful instructions, a benchmark for measuring attack efficacy is provided. We demonstrate the efficacy of our attack by comparing it with baseline methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Flamingo: a visual language model for few-shot learning. ArXiv preprint, abs/2204.14198.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  3. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  4. Sharegpt4v: Improving large multi-modal models with better captions. ArXiv preprint, abs/2311.12793.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
  7. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566.
  8. Obelics: An open web-scale filtered dataset of interleaved image-text documents.
  9. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
  10. Red teaming visual language models. arXiv preprint arXiv:2401.12915.
  11. Improved baselines with visual instruction tuning.
  12. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  13. Visual instruction tuning. ArXiv preprint, abs/2304.08485.
  14. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  15. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  16. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  17. OpenAI. 2023. Gpt-4v(ision) system card.
  18. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
  19. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213.
  20. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
  21. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv preprint, abs/2111.02114.
  22. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models.
  23. How many unicorns are in this image? a safety evaluation benchmark for vision llms. arXiv preprint arXiv:2311.16101.
  24. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  25. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  26. Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv preprint arXiv:2402.11208.
  27. Weak-to-strong jailbreaking on large language models. arXiv preprint arXiv:2401.17256.
  28. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  29. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 58 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube