Abstract
Knowledge Distillation (KD) is a widely used model compression technique that primarily transfers knowledge by aligning the predictions of a student model with those of a teacher model. Besides the traditional logit-based KD, combining data augmentation techniques, like MixUp, is another effective way to improve the distillation efficiency. However, as a powerful data augmentation method, PatchMix has shown limited effectiveness in CNN-based knowledge distillation. It is likely due to constraints in the CNN teacher’s receptive field and the absence of PatchMix-retrained teacher models. In this paper, we explore why PatchMix tends to be less effective than MixUp, and further introduce a novel framework called PatchMix Simulation Knowledge Distillation (PS-KD). The proposed new framework simulates a PatchMix-retrained teacher using an vanilla one to guide the student’s training, ensuring the high-fidelity information distillation in feature space. By revisiting the use of PatchMix in CNNs and reducing information distortion, our model is capable to enhance CNN’s spatial invariance and increase the fidelity of network representations. Extensive experiments demonstrate the superiority of our approach, enabling the network to identify discriminative regions in images with greater accuracy. The Code will be released soon.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Cognitive and Developmental Systems |
| DOIs | |
| State | Accepted/In press - 2025 |
| Externally published | Yes |
Keywords
- Knowledge Distillation
- Model Compression
- PatchMix
Fingerprint
Dive into the research topics of 'PS-KD: PatchMix Simulation for High-Fidelity Knowledge Distillation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver