TY - GEN
T1 - Cross-modal knowledge distillation for fine-grained one-shot classification
AU - Zhao, Jiabao
AU - Lin, Xin
AU - Yang, Yifan
AU - Yang, Jing
AU - He, Liang
N1 - Publisher Copyright:
©2021 IEEE
PY - 2021
Y1 - 2021
N2 - Few-shot learning can recognize a novel category based on only a few samples because it learns to learn from a lot of labeled samples during the training process. When data is insufficient, the performance is affected. And it is expensive to obtain a large-scale fine-grained dataset with annotation. In this paper, we adopt domain-specific knowledge to fill the gap of insufficient annotated data. We propose a cross-modal knowledge distillation (CMKD) framework to do fine-grained one-shot classification and propose the Spatial Relation Loss (SRL) to transfer cross-modal information, which can tackle the semantic gap between multimodal features. The teacher network distills the spatial relationship of the samples as a soft target for training a unimodal student network. Notably, the student network makes predictions only based on a few samples without any external knowledge in the application. This model-agnostic framework will be well adapted to other few-shot models. Extensive experimental results on benchmarks demonstrate that CMKD can make full use of cross-modal knowledge in image and text few-shot classification. CKMD improves the performances of the student networks significantly, even if it is a state-of-the-art student network.
AB - Few-shot learning can recognize a novel category based on only a few samples because it learns to learn from a lot of labeled samples during the training process. When data is insufficient, the performance is affected. And it is expensive to obtain a large-scale fine-grained dataset with annotation. In this paper, we adopt domain-specific knowledge to fill the gap of insufficient annotated data. We propose a cross-modal knowledge distillation (CMKD) framework to do fine-grained one-shot classification and propose the Spatial Relation Loss (SRL) to transfer cross-modal information, which can tackle the semantic gap between multimodal features. The teacher network distills the spatial relationship of the samples as a soft target for training a unimodal student network. Notably, the student network makes predictions only based on a few samples without any external knowledge in the application. This model-agnostic framework will be well adapted to other few-shot models. Extensive experimental results on benchmarks demonstrate that CMKD can make full use of cross-modal knowledge in image and text few-shot classification. CKMD improves the performances of the student networks significantly, even if it is a state-of-the-art student network.
KW - Cross-Modal
KW - Few-shot Learning
KW - Knowledge Distillation
UR - https://www.scopus.com/pages/publications/85115129422
U2 - 10.1109/ICASSP39728.2021.9414480
DO - 10.1109/ICASSP39728.2021.9414480
M3 - 会议稿件
AN - SCOPUS:85115129422
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 4295
EP - 4299
BT - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -