TY - JOUR
T1 - PS-KD
T2 - PatchMix Simulation for High-Fidelity Knowledge Distillation
AU - Xu, KJiazhen
AU - Wang, Chong
AU - Lin, Sunqi
AU - Xie, Yuqi
AU - Qian, Jiangbo
AU - Wu, Jiafei
AU - Li, Yuqi
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2025
Y1 - 2025
N2 - Knowledge Distillation (KD) is a widely used model compression technique that primarily transfers knowledge by aligning the predictions of a student model with those of a teacher model. Besides the traditional logit-based KD, combining data augmentation techniques, like MixUp, is another effective way to improve the distillation efficiency. However, as a powerful data augmentation method, PatchMix has shown limited effectiveness in CNN-based knowledge distillation. It is likely due to constraints in the CNN teacher’s receptive field and the absence of PatchMix-retrained teacher models. In this paper, we explore why PatchMix tends to be less effective than MixUp, and further introduce a novel framework called PatchMix Simulation Knowledge Distillation (PS-KD). The proposed new framework simulates a PatchMix-retrained teacher using an vanilla one to guide the student’s training, ensuring the high-fidelity information distillation in feature space. By revisiting the use of PatchMix in CNNs and reducing information distortion, our model is capable to enhance CNN’s spatial invariance and increase the fidelity of network representations. Extensive experiments demonstrate the superiority of our approach, enabling the network to identify discriminative regions in images with greater accuracy. The Code will be released soon.
AB - Knowledge Distillation (KD) is a widely used model compression technique that primarily transfers knowledge by aligning the predictions of a student model with those of a teacher model. Besides the traditional logit-based KD, combining data augmentation techniques, like MixUp, is another effective way to improve the distillation efficiency. However, as a powerful data augmentation method, PatchMix has shown limited effectiveness in CNN-based knowledge distillation. It is likely due to constraints in the CNN teacher’s receptive field and the absence of PatchMix-retrained teacher models. In this paper, we explore why PatchMix tends to be less effective than MixUp, and further introduce a novel framework called PatchMix Simulation Knowledge Distillation (PS-KD). The proposed new framework simulates a PatchMix-retrained teacher using an vanilla one to guide the student’s training, ensuring the high-fidelity information distillation in feature space. By revisiting the use of PatchMix in CNNs and reducing information distortion, our model is capable to enhance CNN’s spatial invariance and increase the fidelity of network representations. Extensive experiments demonstrate the superiority of our approach, enabling the network to identify discriminative regions in images with greater accuracy. The Code will be released soon.
KW - Knowledge Distillation
KW - Model Compression
KW - PatchMix
UR - https://www.scopus.com/pages/publications/105025797813
U2 - 10.1109/TCDS.2025.3647220
DO - 10.1109/TCDS.2025.3647220
M3 - 文章
AN - SCOPUS:105025797813
SN - 2379-8920
JO - IEEE Transactions on Cognitive and Developmental Systems
JF - IEEE Transactions on Cognitive and Developmental Systems
ER -