On the Learnability of Watermarks for Language Models (2312.04469v3)

Published 7 Dec 2023 in cs.LG, cs.CL, and cs.CR

Abstract: Watermarking of LLM outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of LLMs. Existing watermarking strategies operate by altering the decoder of an existing LLM. In this paper, we ask whether LLMs can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.

References (49)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces watermark distillation, where a student model learns watermark behavior from a teacher model using logits and sampling methods.
Experiments show that high-distortion watermarks are learned efficiently while low-distortion ones require higher sample complexity.
The findings highlight the risk of spoofing attacks and the challenge of retaining watermark signatures after fine-tuning.

Introduction to Watermarks in LLMs

Watermarking techniques for LLMs enable the statistical identification of text generated by such models. This capacity is particularly vital for managing the implications of LLM applications, such as mitigating misinformation spread or ensuring content traceability. Traditional watermarking approaches change the model's output during the decoding phase. This paper explores a different strategy—the potential for LLMs to learn watermarks in a manner that doesn't rely on altering the decoding process, which would benefit open model deployment and address concerns related to text provenance.

Learning Watermarks

The paper introduces the concept of watermark distillation, whereby a 'student' LLM is trained to emulate the watermarking behavior of a 'teacher' model which employs standard decoding-based watermarking methods. This approach could lead to the creation of open models that generate watermarked outputs more naturally. However, it also raises the possibility of spoofing attacks, where a malicious entity could fake the watermark of a reliable model to generate harmful content, creating a risk to the victim model's reputation.

Experimentation and Findings

The researchers tested both logits-based and sampling-based methods for watermark distillation, using a variety of watermarking strategies and settings from low-distortion to high-distortion watermarks. They found that high-distortion watermarks were learned effectively by both techniques. Conversely, low-distortion watermarks posed more challenges and required higher sample complexity for effective learning. The team also released their experimental code to the public for further research and application testing.

Implications and Future Work

The findings illuminate a path towards developing models capable of generating watermarked content without relying on specialized decoding algorithms, enhancing their robustness. However, the capability of watermarks is lost upon fine-tuning with normal text, which underlines the need for improved resilience in future watermarking schemes. Additionally, the paper presents a proof-of-concept for a spoofing attack, pointing to limitations in using watermarks as a definitive indicator of a model's authorship of text, a critical consideration for the responsible deployment of watermarking in LLMs.

PDF Markdown

GitHub

GitHub - chenchenygu/watermark-learnability (26 stars)

Tweets

https://twitter.com/FSFG/status/1786493393205424336