GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis

Published 15 Jul 2024 in cs.CR, cs.AI, cs.SD, and eess.AS | (2407.10471v2)

Abstract: Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Citations (1)

View on Semantic Scholar

Summary

The paper presents Groot, a novel approach that embeds watermarks during diffusion-model-based audio synthesis.
It integrates a fixed-parameter diffusion model with a dedicated encoder, allowing simultaneous watermark embedding and audio generation.
Experiments demonstrate approximately 95% watermark extraction accuracy, showing high resilience to both individual and compound attacks.

The paper "GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis" addresses the challenge of identifying synthesized audio in an era where generative models like diffusion models are advancing rapidly. As these models evolve, distinguishing between synthesized and natural audio has become increasingly complex. While deepfake detection methods can help, they also inadvertently drive further enhancements in generative models.

Watermarking arises as a proactive solution to manage and monitor synthesized audio. This paper introduces Groot, a novel approach to generative robust audio watermarking designed specifically for diffusion models. The method allows for the simultaneous generation of watermarks and audio synthesis through parameter-fixed diffusion models integrated with a dedicated encoder.

The innovation of Groot lies in its architecture, where a watermark is embedded directly into the synthesized audio. Once embedded, the watermark can be efficiently extracted using a lightweight decoder. This approach provides a mechanism for not only watermarking the audio but also keeping track of the diffusion models responsible for the synthesis, thereby offering an effective tool for supervision and control.

The experimental results demonstrate Groot's superior performance in terms of robustness. It excels over existing state-of-the-art watermarking methods by maintaining high resilience against both individual post-processing attacks and complex compound attacks. In particular, Groot exhibits an impressive average watermark extraction accuracy of around 95%, highlighting its potential as a robust and reliable solution in the domain of audio synthesis watermarking.

Overall, this paper establishes a new paradigm for managing and tracing synthesized audio, contributing significantly to the field of generative audio models and offering robustness that could greatly enhance content regulation efforts.

Markdown Report Issue