TY - GEN
T1 - Knowledge distillation with a precise teacher and prediction with abstention
AU - Xu, Yi
AU - Pu, Jian
AU - Zhao, Hui
N1 - Publisher Copyright:
© 2020 IEEE
PY - 2020
Y1 - 2020
N2 - Knowledge distillation, which aims to train model under the supervision from another large model (teacher model) to the original model (student model), has achieved remarkable results in supervised learning. However, there are two major problems with existing knowledge distillation methods. One is the teacher's supervision is sometimes misleading, and the other is the student's prediction is not accurate enough. To address the first issue, instead of learning a combination of both teachers and ground truth, we apply knowledge adjustment to correct teachers' supervision using ground truth. For the second problem, we use the selective classification framework to train the student model. In particular, the deep gambler loss is adopted to predict with reservation by explicitly introducing the (m + 1)-th class. We consider two settings of knowledge distillation: (1) distillation across different network structures (AlexNet, ResNet), and (2) distillation across networks with different depths (ResNet18, ResNet50) to evaluate the effectiveness of our method. The experimental results on benchmark datasets (i.e., Fashion-MNIST, SVHN, CIFAR10, CIFAR100) are reported with higher prediction accuracies and lower coverage errors.
AB - Knowledge distillation, which aims to train model under the supervision from another large model (teacher model) to the original model (student model), has achieved remarkable results in supervised learning. However, there are two major problems with existing knowledge distillation methods. One is the teacher's supervision is sometimes misleading, and the other is the student's prediction is not accurate enough. To address the first issue, instead of learning a combination of both teachers and ground truth, we apply knowledge adjustment to correct teachers' supervision using ground truth. For the second problem, we use the selective classification framework to train the student model. In particular, the deep gambler loss is adopted to predict with reservation by explicitly introducing the (m + 1)-th class. We consider two settings of knowledge distillation: (1) distillation across different network structures (AlexNet, ResNet), and (2) distillation across networks with different depths (ResNet18, ResNet50) to evaluate the effectiveness of our method. The experimental results on benchmark datasets (i.e., Fashion-MNIST, SVHN, CIFAR10, CIFAR100) are reported with higher prediction accuracies and lower coverage errors.
UR - https://www.scopus.com/pages/publications/85110429597
U2 - 10.1109/ICPR48806.2021.9412696
DO - 10.1109/ICPR48806.2021.9412696
M3 - 会议稿件
AN - SCOPUS:85110429597
T3 - Proceedings - International Conference on Pattern Recognition
SP - 9000
EP - 9006
BT - Proceedings of ICPR 2020 - 25th International Conference on Pattern Recognition
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th International Conference on Pattern Recognition, ICPR 2020
Y2 - 10 January 2021 through 15 January 2021
ER -