Adaptive Text Watermark for Large Language Models (2401.13927v2)

Published 25 Jan 2024 in cs.CL

Abstract: The advancement of LLMs has led to increasing concerns about the misuse of AI-generated text, and watermarking for LLM-generated text has emerged as a potential solution. However, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. This paper proposes an adaptive watermarking strategy to address this problem. To improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. For the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. Our experiments involving various LLMs demonstrate that our approach achieves comparable robustness performance to existing watermark methods. Additionally, the text generated by our method has perplexity comparable to that of \emph{un-watermarked} LLMs while maintaining security even under various attacks.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces an adaptive watermarking strategy that uses entropy analysis to selectively embed watermarks without compromising text quality.
It implements semantic-based logits scaling and temperature adjustments to enhance security and keep perplexity low across various LLMs.
Experimental results demonstrate robust performance against paraphrase attacks, outperforming traditional watermarking methods on models like OPT and GPT-J.

Introduction

A newly proposed adaptive watermarking strategy for LLMs tackles key challenges associated with AI-generated text watermarking. Current watermarking methods face obstacles in ensuring strong security, robustness, and high-quality text while allowing watermark detection without prior knowledge of the model or prompt. The developed strategy focuses on adaptively applying watermarking to token distributions with high entropy and preserving low entropy distributions.

Methodology

Central to the method's robustness and text quality is the Adaptive Watermark Token Identification (AWTI) which selectively applies watermarking based on entropy analysis using an auxiliary model. To bolster security, a Semantic-based Logits Scaling Vector Extraction (SLSVE) replaces the traditional fixed 'green/red' list with semantically-derived logits scaling vectors. This technique is more challenging to reverse-engineer. Additionally, Adaptive Watermark Temperature Scaling (AWTS) adaptively scales output logits, reducing the impact on text quality.

Experiments

The approach underwent extensive testing against various LLMs, including OPT-2.7B, OPT-6.7B, and GPT-J-6B. Metrics such as perplexity, alongside empirical robustness validation against paraphrase attacks, were employed. The perplexity of texts with the adaptive watermark was comparable to un-watermarked counterparts, while maintaining security. The adaptive method outperformed existing methods under paraphrase attacks, showcasing the robustness of its watermarking.

Conclusion

The adaptive watermarking technique presented addresses three critical aspects of LLM watermarking: robustness, security, and text quality. Through entropy-based adaptive watermarking, semantic logits scaling, and temperature scaling, the method ensures strong performance even under attacks that alter watermarked text. Furthermore, the detection process leverages AWTI combined with SLSVE, making it independent of the LLMs and the original prompts. This provides a holistic solution to the challenges facing LLM watermarking today.

PDF Markdown