Emergent Mind

On the Learnability of Watermarks for Language Models

(2312.04469)
Published Dec 7, 2023 in cs.LG , cs.CL , and cs.CR

Abstract

Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.

Overview

  • The paper surveys watermarking techniques in language models that identify model-generated text, which is essential for managing language model implications.

  • It introduces the idea of watermark distillation, where a 'student' model learns to mimic a 'teacher' model's watermarking behavior without altering the decoding process.

  • Different watermarking strategies, from low to high-distortion watermarks, were tested to see how effectively 'student' models could learn them.

  • High-distortion watermarks were learned effectively, while low-distortion watermarks were harder to instill and required more training data.

  • The findings highlight potential improvements for robust watermarking and indicate susceptibility to spoofing attacks, suggesting watermarks might not be a definitive authorship indicator.

Introduction to Watermarks in Language Models

Watermarking techniques for language models enable the statistical identification of text generated by such models. This capacity is particularly vital for managing the implications of language model applications, such as mitigating misinformation spread or ensuring content traceability. Traditional watermarking approaches change the model's output during the decoding phase. This study explores a different strategy—the potential for language models to learn watermarks in a manner that doesn't rely on altering the decoding process, which would benefit open model deployment and address concerns related to text provenance.

Learning Watermarks

The study introduces the concept of watermark distillation, whereby a 'student' language model is trained to emulate the watermarking behavior of a 'teacher' model which employs standard decoding-based watermarking methods. This approach could lead to the creation of open models that generate watermarked outputs more naturally. However, it also raises the possibility of spoofing attacks, where a malicious entity could fake the watermark of a reliable model to generate harmful content, creating a risk to the victim model's reputation.

Experimentation and Findings

The researchers tested both logits-based and sampling-based methods for watermark distillation, using a variety of watermarking strategies and settings from low-distortion to high-distortion watermarks. They found that high-distortion watermarks were learned effectively by both techniques. Conversely, low-distortion watermarks posed more challenges and required higher sample complexity for effective learning. The team also released their experimental code to the public for further research and application testing.

Implications and Future Work

The findings illuminate a path towards developing models capable of generating watermarked content without relying on specialized decoding algorithms, enhancing their robustness. However, the capability of watermarks is lost upon fine-tuning with normal text, which underlines the need for improved resilience in future watermarking schemes. Additionally, the study presents a proof-of-concept for a spoofing attack, pointing to limitations in using watermarks as a definitive indicator of a model's authorship of text, a critical consideration for the responsible deployment of watermarking in language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.