Skip to main navigation Skip to search Skip to main content

Unified attacks to large language model watermarks: Spoofing and scrubbing in unauthorized knowledge distillation

  • East China Normal University

Research output: Contribution to journalArticlepeer-review

Abstract

Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A particularly promising property, known as watermark radioactivity, offers potential for preventing the unauthorized use of LLM outputs in downstream distillation pipelines. However, the robustness of watermarking against scrubbing attacks and its unforgeability under spoofing attacks in unauthorized knowledge distillation settings remain underexplored. Existing attack methods either assume access to model internals or fail to support both attack types simultaneously. In our work, we propose Contrastive Decoding-guided Knowledge Distillation (CDG-KD), a unified framework that enables dual-path attacks under unauthorized knowledge distillation. At the core of CDG-KD is a novel contrastive decoding mechanism with token-level constraint fusion, which integrates a learned watermark discriminator and probability-based constraint component to selectively manipulate watermark-relevant logits. This allows for fine-grained control of watermark strength during generation without compromising fluency or semantics. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs, followed by dual-path distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable. Our code is available at https://github.com/xinykou/CDG-KD.

Original languageEnglish
Article number114295
JournalKnowledge-Based Systems
Volume329
DOIs
StatePublished - 4 Nov 2025

Keywords

  • Knowledge distillation
  • Large language model
  • Watermark

Fingerprint

Dive into the research topics of 'Unified attacks to large language model watermarks: Spoofing and scrubbing in unauthorized knowledge distillation'. Together they form a unique fingerprint.

Cite this