Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Learnability of Watermarks for Language Models (2312.04469v3)

Published 7 Dec 2023 in cs.LG, cs.CL, and cs.CR

Abstract: Watermarking of LLM outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of LLMs. Existing watermarking strategies operate by altering the decoder of an existing LLM. In this paper, we ask whether LLMs can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Scott Aaronson. Watermarking of large language models. Large Language Models and Transformers Workshop at Simons Institute for the Theory of Computing, 2023. URL https://www.youtube.com/watch?v=2Kx9jbSMZqA.
  2. Adversarial watermarking transformer: Towards tracing text provenance with data hiding, May 2021. URL http://dx.doi.org/10.1109/SP40001.2021.00083.
  3. Real or fake? learning to discriminate machine from human generated text, 2019. URL https://arxiv.org/abs/1906.03351.
  4. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  2397–2430. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
  5. Undetectable watermarks for language models, 2023. URL https://arxiv.org/abs/2306.09194.
  6. A discourse-aware attention model for abstractive summarization of long documents. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018. doi: 10.18653/v1/n18-2097. URL http://dx.doi.org/10.18653/v1/n18-2097.
  7. Publicly detectable watermarking for language models, 2023. URL https://arxiv.org/abs/2310.18491.
  8. Wikimedia Foundation. Wikimedia downloads, 2022. URL https://dumps.wikimedia.org.
  9. Watermarking conditional text generation for ai detection: Unveiling challenges and a semantic-aware watermark remedy, 2023. URL https://arxiv.org/abs/2307.13808.
  10. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  111–116, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-3019. URL https://aclanthology.org/P19-3019.
  11. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  12. Protecting intellectual property of language generation apis with lexical watermark. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10758–10766, Jun. 2022a. doi: 10.1609/aaai.v36i10.21321. URL https://ojs.aaai.org/index.php/AAAI/article/view/21321.
  13. CATER: intellectual property protection on text generation apis via conditional watermarks. In NeurIPS, 2022b. URL http://papers.nips.cc/paper_files/paper/2022/hash/2433fec2144ccf5fea1c9c5ebdbc3924-Abstract-Conference.html.
  14. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531.
  15. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
  16. Semstamp: A semantic watermark with paraphrastic robustness for text generation, 2023. URL https://arxiv.org/abs/2310.03991.
  17. Unbiased watermark for large language models, 2023. URL https://arxiv.org/abs/2310.10669.
  18. Automatic detection of machine generated text: A critical survey. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, pp.  2296–2309, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.208. URL https://aclanthology.org/2020.coling-main.208.
  19. A review of text watermarking: Theory, methods, and applications. IEEE Access, 6:8011–8028, 2018. doi: 10.1109/ACCESS.2018.2796585. URL https://doi.org/10.1109/ACCESS.2018.2796585.
  20. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1317–1327, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1139. URL https://aclanthology.org/D16-1139.
  21. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  22. A watermark for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  17061–17084. PMLR, 23–29 Jul 2023a. URL https://proceedings.mlr.press/v202/kirchenbauer23a.html.
  23. On the reliability of watermarks for large language models, 2023b. URL https://arxiv.org/abs/2306.04634.
  24. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023. URL https://arxiv.org/abs/2303.13408.
  25. Robust distortion-free watermarks for language models, 2023. URL https://arxiv.org/abs/2307.15593.
  26. A private watermark for large language models, 2023a. URL https://arxiv.org/abs/2307.16230.
  27. A semantic invariant robust watermark for large language models, 2023b. URL https://arxiv.org/abs/2310.06356.
  28. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  29. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  24950–24962. PMLR, 2023. URL https://proceedings.mlr.press/v202/mitchell23a.html.
  30. Red teaming language models with language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.225. URL https://aclanthology.org/2022.emnlp-main.225.
  31. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  4816–4828. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/260c2432a0eecc28ce03c10dadc078a4-Paper.pdf.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  33. A robust semantics-based watermark for large language model against paraphrasing, 2023. URL https://arxiv.org/abs/2311.08721.
  34. Fine-grain watermarking for intellectual property protection. EURASIP Journal on Information Security, 2019(1), July 2019. doi: 10.1186/s13635-019-0094-2. URL https://doi.org/10.1186/s13635-019-0094-2.
  35. Can ai-generated text be reliably detected?, 2023. URL https://arxiv.org/abs/2303.11156.
  36. Release strategies and the social impacts of language models, 2019. URL https://arxiv.org/abs/1908.09203.
  37. Detecting cross-modal inconsistency to defend against neural fake news. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2081–2106, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.163. URL https://aclanthology.org/2020.emnlp-main.163.
  38. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  39. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
  40. Watermarking the outputs of structured prediction with an application in statistical machine translation. In Regina Barzilay and Mark Johnson (eds.), Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp.  1363–1372, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. URL https://aclanthology.org/D11-1126.
  41. Neural text generation with unlikelihood training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SJeYe0NtvH.
  42. Dipmark: A stealthy, efficient and resilient watermark for large language models, 2023. URL https://arxiv.org/abs/2310.07710.
  43. Tracing text provenance via context-aware lexical substitution. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11613–11621, Jun. 2022. doi: 10.1609/aaai.v36i10.21415. URL https://ojs.aaai.org/index.php/AAAI/article/view/21415.
  44. Advancing beyond identification: Multi-bit watermark for large language models, 2023. URL https://arxiv.org/abs/2308.00221.
  45. Defending against neural fake news. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  9051–9062, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html.
  46. Watermarks in the sand: Impossibility of strong watermarking for generative models, 2023. URL https://arxiv.org/abs/2311.04378.
  47. Distillation-resistant watermarking for model protection in NLP. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5044–5055, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.370. URL https://aclanthology.org/2022.findings-emnlp.370.
  48. Provable robust watermarking for ai-generated text, 2023a. URL https://arxiv.org/abs/2306.17439.
  49. Protecting language generation models via invisible watermarking. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  42187–42199. PMLR, 2023b. URL https://proceedings.mlr.press/v202/zhao23i.html.
Citations (25)

Summary

  • The paper introduces watermark distillation, where a student model learns watermark behavior from a teacher model using logits and sampling methods.
  • Experiments show that high-distortion watermarks are learned efficiently while low-distortion ones require higher sample complexity.
  • The findings highlight the risk of spoofing attacks and the challenge of retaining watermark signatures after fine-tuning.

Introduction to Watermarks in LLMs

Watermarking techniques for LLMs enable the statistical identification of text generated by such models. This capacity is particularly vital for managing the implications of LLM applications, such as mitigating misinformation spread or ensuring content traceability. Traditional watermarking approaches change the model's output during the decoding phase. This paper explores a different strategy—the potential for LLMs to learn watermarks in a manner that doesn't rely on altering the decoding process, which would benefit open model deployment and address concerns related to text provenance.

Learning Watermarks

The paper introduces the concept of watermark distillation, whereby a 'student' LLM is trained to emulate the watermarking behavior of a 'teacher' model which employs standard decoding-based watermarking methods. This approach could lead to the creation of open models that generate watermarked outputs more naturally. However, it also raises the possibility of spoofing attacks, where a malicious entity could fake the watermark of a reliable model to generate harmful content, creating a risk to the victim model's reputation.

Experimentation and Findings

The researchers tested both logits-based and sampling-based methods for watermark distillation, using a variety of watermarking strategies and settings from low-distortion to high-distortion watermarks. They found that high-distortion watermarks were learned effectively by both techniques. Conversely, low-distortion watermarks posed more challenges and required higher sample complexity for effective learning. The team also released their experimental code to the public for further research and application testing.

Implications and Future Work

The findings illuminate a path towards developing models capable of generating watermarked content without relying on specialized decoding algorithms, enhancing their robustness. However, the capability of watermarks is lost upon fine-tuning with normal text, which underlines the need for improved resilience in future watermarking schemes. Additionally, the paper presents a proof-of-concept for a spoofing attack, pointing to limitations in using watermarks as a definitive indicator of a model's authorship of text, a critical consideration for the responsible deployment of watermarking in LLMs.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com