Abstract
Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A particularly promising property, known as watermark radioactivity, offers potential for preventing the unauthorized use of LLM outputs in downstream distillation pipelines. However, the robustness of watermarking against scrubbing attacks and its unforgeability under spoofing attacks in unauthorized knowledge distillation settings remain underexplored. Existing attack methods either assume access to model internals or fail to support both attack types simultaneously. In our work, we propose Contrastive Decoding-guided Knowledge Distillation (CDG-KD), a unified framework that enables dual-path attacks under unauthorized knowledge distillation. At the core of CDG-KD is a novel contrastive decoding mechanism with token-level constraint fusion, which integrates a learned watermark discriminator and probability-based constraint component to selectively manipulate watermark-relevant logits. This allows for fine-grained control of watermark strength during generation without compromising fluency or semantics. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs, followed by dual-path distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable. Our code is available at https://github.com/xinykou/CDG-KD.
| Original language | English |
|---|---|
| Article number | 114295 |
| Journal | Knowledge-Based Systems |
| Volume | 329 |
| DOIs | |
| State | Published - 4 Nov 2025 |
Keywords
- Knowledge distillation
- Large language model
- Watermark
Fingerprint
Dive into the research topics of 'Unified attacks to large language model watermarks: Spoofing and scrubbing in unauthorized knowledge distillation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver