Emergent Mind

Adaptive Text Watermark for Large Language Models

(2401.13927)
Published Jan 25, 2024 in cs.CL

Abstract

The advancement of LLMs has led to increasing concerns about the misuse of AI-generated text, and watermarking for LLM-generated text has emerged as a potential solution. However, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. This paper proposes an adaptive watermarking strategy to address this problem. To improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. For the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. Our experiments involving various LLMs demonstrate that our approach achieves comparable robustness performance to existing watermark methods. Additionally, the text generated by our method has perplexity comparable to that of \emph{un-watermarked} LLMs while maintaining security even under various attacks.

Overview

  • A new watermarking strategy for LLMs offers solutions to enhance security, robustness, and maintain high text quality without requiring knowledge of the model or prompt.

  • The Adaptive Watermark Token Identification (AWTI) technique uses entropy analysis for selective watermarking to preserve text quality and robustness.

  • Semantic-based Logits Scaling Vector Extraction (SLSVE) improves watermark security by using semantics to scale logits, which is more difficult to reverse-engineer than traditional methods.

  • The Adaptive Watermark Temperature Scaling (AWTS) method adaptively scales output logits to minimize the impact on the quality of the watermarked text.

  • Extensive testing on multiple LLMs like OPT-2.7B, OPT-6.7B, and GPT-J-6B shows that the adaptive watermarking method performs comparably to non-watermarked text in terms of perplexity and better under paraphrase attacks.

Introduction

A newly proposed adaptive watermarking strategy for LLMs tackles key challenges associated with AI-generated text watermarking. Current watermarking methods face obstacles in ensuring strong security, robustness, and high-quality text while allowing watermark detection without prior knowledge of the model or prompt. The developed strategy focuses on adaptively applying watermarking to token distributions with high entropy and preserving low entropy distributions.

Methodology

Central to the method's robustness and text quality is the Adaptive Watermark Token Identification (AWTI) which selectively applies watermarking based on entropy analysis using an auxiliary model. To bolster security, a Semantic-based Logits Scaling Vector Extraction (SLSVE) replaces the traditional fixed 'green/red' list with semantically-derived logits scaling vectors. This technique is more challenging to reverse-engineer. Additionally, Adaptive Watermark Temperature Scaling (AWTS) adaptively scales output logits, reducing the impact on text quality.

Experiments

The approach underwent extensive testing against various LLMs, including OPT-2.7B, OPT-6.7B, and GPT-J-6B. Metrics such as perplexity, alongside empirical robustness validation against paraphrase attacks, were employed. The perplexity of texts with the adaptive watermark was comparable to un-watermarked counterparts, while maintaining security. The adaptive method outperformed existing methods under paraphrase attacks, showcasing the robustness of its watermarking.

Conclusion

The adaptive watermarking technique presented addresses three critical aspects of LLM watermarking: robustness, security, and text quality. Through entropy-based adaptive watermarking, semantic logits scaling, and temperature scaling, the method ensures strong performance even under attacks that alter watermarked text. Furthermore, the detection process leverages AWTI combined with SLSVE, making it independent of the LLMs and the original prompts. This provides a holistic solution to the challenges facing LLM watermarking today.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.