Emergent Mind

New Evaluation Metrics Capture Quality Degradation due to LLM Watermarking

(2312.02382)

Published Dec 4, 2023 in cs.CL

Abstract

With the increasing use of large-LLMs like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also found, through the LLM judger, that watermarking impacts text quality, especially in degrading the coherence and depth of the response. Our findings underscore the trade-off between watermark robustness and text quality and highlight the importance of having more informative metrics to assess watermarking quality.

Overview

Study presents novel methods to evaluate the impact of watermarking on LLM outputs.
Watermarking in LLMs aims to add traceable, non-obtrusive markers to generated text without degrading quality.
Analyzes text using GPT-3.5-Turbo and reveals quality issues, particularly with coherence and example use.
MLP-based classifier reliably detects watermarked text, challenging the notion of subtlety in watermarking.
Research indicates a need for improved watermarking methods that don’t impair text quality in essential applications.

In the domain of natural language processing, watermarking techniques have become a significant point of research, especially in the context of LLMs like GPT-3.5-Turbo and Llama-2. The premise behind watermarking is to embed detectable yet non-obtrusive markers in the text generated by LLMs. These markers aim to trace the origin of the text and prevent misuse, but a key challenge is embedding these markers without degrading text quality or becoming easily detectable by third parties.

This study introduces two novel benchmarks designed to evaluate the quality degradation and robustness of watermarking algorithms. The first method involves the use of a tailored prompt with GPT-3.5-Turbo, acting as an impartial judger to score watermarked and unwatermarked texts on factors such as relevance, detail, clarity, coherence, originality, example use, and accuracy. The judger provided specific reasoning for its preferences and scores. This analysis revealed that watermarking impacts the quality of the text, particularly in terms of coherence and the use of specific examples. The second evaluation method uses a binary classifier trained on text embeddings to distinguish between watermarked and unwatermarked text, utilizing a simple multi-layer perceptron architecture.

Through rigorous evaluation using different datasets and watermarking techniques, it was discovered that current watermarking methods could be detected by even simple classification models, which contradicts the ideal of subtle watermarking. Furthermore, these watermarks are observed to affect the overall text quality negatively. While logistic regression could detect watermarks with modest success, the MLP-based classifier had a higher accuracy in differentiating watermarked text, indicating a reliable detection of watermark patterns even without knowledge of the specific techniques or secret keys employed.

The research also delved into how the ideal of invisibility in watermarking is far from being achieved. Even watermarks designed to be distortion-free were found to negatively affect text generation's quality. This poses a significant concern for the future of watermarking techniques, suggesting that the detectability of such watermarks may be an intrinsic property that needs to be addressed.

The implications of this study are substantial for the development of future watermarking methodologies. As the output quality of LLMs is paramount, especially in professional or critical contexts, the balance between robust watermarking and text quality is crucial. The development of watermarking techniques that do not perceptibly alter the generated text from the original model is an area ripe for further research, with the potential to impact the broader landscape of AI and machine learning applications.

Create an account to read this summary for free: