Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety (2405.15202v1)

Published 24 May 2024 in cs.CL and cs.CR

Abstract: Recent studies reveal that LLMs face challenges in balancing safety with utility, particularly when processing long texts for NLP tasks like summarization and translation. Despite defenses against malicious short questions, the ability of LLMs to safely handle dangerous long content, such as manuals teaching illicit activities, remains unclear. Our work aims to develop robust defenses for LLMs in processing malicious documents alongside benign NLP task queries. We introduce a defense dataset comprised of safety-related examples and propose single-task and mixed-task losses for instruction tuning. Our empirical results demonstrate that LLMs can significantly enhance their capacity to safely manage dangerous content with appropriate instruction tuning. Additionally, strengthening the defenses of tasks most susceptible to misuse is effective in protecting LLMs against processing harmful information. We also observe that trade-offs between utility and safety exist in defense strategies, where Llama2, utilizing our proposed approach, displays a significantly better balance compared to Llama1.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.
  2. Identifying and mitigating the security risks of generative ai. Foundations and Trends® in Privacy and Security, 6(1):1–52.
  3. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.
  4. Play guessing game with llm: Indirect jailbreak attack with implicit clues.
  5. Palm: Scaling language modeling with pathways.
  6. Safety alignment in nlp tasks: Weakly aligned summarization as an in-context attack.
  7. Lora: Low-rank adaptation of large language models.
  8. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes.
  9. Catastrophic jailbreak of open-source llms via exploiting generation. In The Twelfth International Conference on Learning Representations.
  10. Baseline defenses for adversarial attacks against aligned language models.
  11. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.
  12. Mistral 7b.
  13. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  14. Smoothllm: Defending large language models against jailbreaking attacks.
  15. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  16. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  17. Llama: Open and efficient foundation language models.
  18. Llama 2: Open foundation and fine-tuned chat models.
  19. Finetuned language models are zero-shot learners.
  20. Lilian Weng. 2023. Adversarial attacks on llms. lilianweng.github.io.
  21. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms.
  22. Autodan: Interpretable gradient-based adversarial attacks on large language models.
  23. Universal and transferable adversarial attacks on aligned language models.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets