PS-KD: PatchMix Simulation for High-Fidelity Knowledge Distillation

  • KJiazhen Xu
  • , Chong Wang*
  • , Sunqi Lin
  • , Yuqi Xie
  • , Jiangbo Qian
  • , Jiafei Wu
  • , Yuqi Li
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Knowledge Distillation (KD) is a widely used model compression technique that primarily transfers knowledge by aligning the predictions of a student model with those of a teacher model. Besides the traditional logit-based KD, combining data augmentation techniques, like MixUp, is another effective way to improve the distillation efficiency. However, as a powerful data augmentation method, PatchMix has shown limited effectiveness in CNN-based knowledge distillation. It is likely due to constraints in the CNN teacher’s receptive field and the absence of PatchMix-retrained teacher models. In this paper, we explore why PatchMix tends to be less effective than MixUp, and further introduce a novel framework called PatchMix Simulation Knowledge Distillation (PS-KD). The proposed new framework simulates a PatchMix-retrained teacher using an vanilla one to guide the student’s training, ensuring the high-fidelity information distillation in feature space. By revisiting the use of PatchMix in CNNs and reducing information distortion, our model is capable to enhance CNN’s spatial invariance and increase the fidelity of network representations. Extensive experiments demonstrate the superiority of our approach, enabling the network to identify discriminative regions in images with greater accuracy. The Code will be released soon.

Original languageEnglish
JournalIEEE Transactions on Cognitive and Developmental Systems
DOIs
StateAccepted/In press - 2025
Externally publishedYes

Keywords

  • Knowledge Distillation
  • Model Compression
  • PatchMix

Fingerprint

Dive into the research topics of 'PS-KD: PatchMix Simulation for High-Fidelity Knowledge Distillation'. Together they form a unique fingerprint.

Cite this