TY - GEN
T1 - A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval
AU - Song, Yuxiang
AU - Zheng, Yuxuan
AU - Zhao, Shangqing
AU - Liu, Shu
AU - Zhuang, Xinlin
AU - Long, Zhaoguang
AU - Sun, Changzhi
AU - Zhou, Aimin
AU - Lan, Man
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Large-scale dual-stream Vision-Language Pre-training (VLP) models provide an efficient solution for text-image retrieval tasks. Despite this, their performance often falls short of the most current single-stream models, primarily due to limited fine-grained text-image interactions. Recent trends indicate a union of these two types of networks. Some methods adopt a retrieve and rerank strategy, their performance improvements largely hinge on the single-stream encoder during inference. Other approaches utilize knowledge distillation to strengthen either the single-stream encoder or the dual-stream encoder, surpassing their previous capabilities. However, existing distillation techniques typically focus on a single knowledge type, neglecting the richer insights available in the teacher model. To bridge this gap, we introduce a Lightweight and Effective Multi-View Knowledge Distillation approach, named LEMKD, for text-image retrieval. This method effectively utilizes response-based, feature-based and relation-based knowledge, transferring the knowledge from the single-stream encoder to the dual-stream encoder. Our approach is executed on the widely used MS-COCO and Flickr30K datasets. Results demonstrate that LEMKD not only matches the exceptional performance of the most advanced single-stream models but also excels in dual-stream encoder performance amidst the recent integration of single-stream and dual-stream models.
AB - Large-scale dual-stream Vision-Language Pre-training (VLP) models provide an efficient solution for text-image retrieval tasks. Despite this, their performance often falls short of the most current single-stream models, primarily due to limited fine-grained text-image interactions. Recent trends indicate a union of these two types of networks. Some methods adopt a retrieve and rerank strategy, their performance improvements largely hinge on the single-stream encoder during inference. Other approaches utilize knowledge distillation to strengthen either the single-stream encoder or the dual-stream encoder, surpassing their previous capabilities. However, existing distillation techniques typically focus on a single knowledge type, neglecting the richer insights available in the teacher model. To bridge this gap, we introduce a Lightweight and Effective Multi-View Knowledge Distillation approach, named LEMKD, for text-image retrieval. This method effectively utilizes response-based, feature-based and relation-based knowledge, transferring the knowledge from the single-stream encoder to the dual-stream encoder. Our approach is executed on the widely used MS-COCO and Flickr30K datasets. Results demonstrate that LEMKD not only matches the exceptional performance of the most advanced single-stream models but also excels in dual-stream encoder performance amidst the recent integration of single-stream and dual-stream models.
KW - knowledge distillation
KW - multi-modal
KW - text-image retrieval
UR - https://www.scopus.com/pages/publications/85204943260
U2 - 10.1109/IJCNN60899.2024.10650723
DO - 10.1109/IJCNN60899.2024.10650723
M3 - 会议稿件
AN - SCOPUS:85204943260
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Joint Conference on Neural Networks, IJCNN 2024
Y2 - 30 June 2024 through 5 July 2024
ER -