Emergent Mind

Watermark Smoothing Attacks against Language Models

(2407.14206)
Published Jul 19, 2024 in cs.LG

Abstract

Watermarking is a technique used to embed a hidden signal in the probability distribution of text generated by LLMs, enabling attribution of the text to the originating model. We introduce smoothing attacks and show that existing watermarking methods are not robust against minor modifications of text. An adversary can use weaker language models to smooth out the distribution perturbations caused by watermarks without significantly compromising the quality of the generated text. The modified text resulting from the smoothing attack remains close to the distribution of text that the original model (without watermark) would have produced. Our attack reveals a fundamental limitation of a wide range of watermarking techniques.

Overview

  • The paper introduces a novel 'smoothing attack' methodology to circumvent watermarking techniques in LLMs, revealing fundamental vulnerabilities in current watermarking methods.

  • This attack is executed in two phases: watermark inference and watermark smoothing, which together allow the adversary to effectively neutralize watermark signals without degrading text quality.

  • Experimental evaluations using Llama-7B and OPT-6.7B models demonstrate that the proposed attack achieves high success rates in identifying and smoothing out watermarked tokens, while maintaining the quality of generated text.

Watermark Smoothing Attacks against Language Models

The paper "Watermark Smoothing Attacks against Language Models" by Hongyan Chang, Hamed Hassani, and Reza Shokri provides a detailed examination of the robustness of statistical watermarking techniques employed in LLMs. The authors introduce a novel attack methodology termed "smoothing attacks" aimed at circumventing existing watermarking techniques without significantly degrading the quality of the generated text. This paper's investigation reveals fundamental limitations of conventional watermarking methods, especially when confronted with adversaries that possess weaker reference models.

Background and Problem Statement

Watermarking in LLMs involves embedding subtle signals within the probability distributions of text sequences to make the text attributable to a specific model. These watermarks are designed to be undetectable by human readers while remaining identifiable by automated detection methods. The two primary challenges of watermarking are maintaining text quality and preventing easy erasure of the watermark.

In this research, the authors focus on evaluating and enhancing the second challenge by introducing smoothing attacks against watermarked outputs. The primary goal of the adversary in this context is to recover text similar to what the original unwatermarked model would produce. The attack leverages a weaker reference model to smooth out the statistical perturbations introduced by the watermark.

Attack Framework

The attack is bifurcated into two distinct phases: watermark inference and watermark smoothing.

Phase I: Watermark Inference

The authors assume an adversary with access to a weaker reference model ($M_{ref}$). The target model is denoted as $\tilde{M}$ (watermarked model), and the goal is to infer the "green list" (watermarked tokens). The key observation utilized is that, while different models might generate somewhat differing token distributions, the relative ranking of tokens by likelihood should generally agree. In contrast, watermarked models introduce systematic shifts favoring green tokens, facilitating their identification by rank differences.

By querying both models with varied prefixes while keeping the context for watermark dependency fixed, the attack averages out model discrepancies and amplifies the watermark-induced shifts. This produces a watermark inference score that measures relative token shifts, effectively differentiating between green and red tokens.

Phase II: Watermark Smoothing

Once the green list is inferred with high confidence, the smoothing phase employs a weighted average approach to generate logits. For tokens likely to be watermarked, the output distribution is adjusted by interpolating with the reference model's logits. This method effectively neutralizes the watermark's perturbations while preserving high utility in the generated text.

Experimental Evaluation

The authors validate their approach using Llama-7B and OPT-6.7B models on LFQA and OpenGen datasets. Key metrics include perplexity for text quality and the z-score for watermark detection strength. Comparisons are made against established paraphrasing attacks and simpler average-based attacks.

Results

  • Watermark Inference: The inference score, robustly averaged over multiple prefixes, achieves high AUC values (>0.9), indicating strong performance in identifying green tokens.
  • Watermark Smoothing: The generated text post-attack maintains high quality, comparable to unwatermarked text, with significantly reduced detectability as evidenced by lower z-scores and positive prediction rates close to 0%. Notably, adversarial samples evade detection while avoiding the pitfalls of degraded text quality seen in naive paraphrasing.

Implications and Future Work

The findings underscore substantive vulnerabilities in current watermarking schemes. With rapid advancements in LLMs, such attacks highlight the need for more resilient watermarking strategies. Future developments may include:

  1. Enhanced Robustness: Designing watermarking methods resistant to smoothing attacks.
  2. Efficiency Improvements: Reducing query requirements for effective attacks, making them feasible on larger scales.
  3. Regulatory Measures: Implementing standard practices to ensure responsible AI deployment and usage.

This paper contributes significantly to the ongoing discourse on AI security, advocating for dynamic and resilient approaches to watermarking in the context of evolving adversarial threats. The techniques and insights provided form a robust foundation for future research addressing these critical challenges.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.