A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval

  • Yuxiang Song
  • , Yuxuan Zheng
  • , Shangqing Zhao
  • , Shu Liu
  • , Xinlin Zhuang
  • , Zhaoguang Long
  • , Changzhi Sun
  • , Aimin Zhou
  • , Man Lan*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Large-scale dual-stream Vision-Language Pre-training (VLP) models provide an efficient solution for text-image retrieval tasks. Despite this, their performance often falls short of the most current single-stream models, primarily due to limited fine-grained text-image interactions. Recent trends indicate a union of these two types of networks. Some methods adopt a retrieve and rerank strategy, their performance improvements largely hinge on the single-stream encoder during inference. Other approaches utilize knowledge distillation to strengthen either the single-stream encoder or the dual-stream encoder, surpassing their previous capabilities. However, existing distillation techniques typically focus on a single knowledge type, neglecting the richer insights available in the teacher model. To bridge this gap, we introduce a Lightweight and Effective Multi-View Knowledge Distillation approach, named LEMKD, for text-image retrieval. This method effectively utilizes response-based, feature-based and relation-based knowledge, transferring the knowledge from the single-stream encoder to the dual-stream encoder. Our approach is executed on the widely used MS-COCO and Flickr30K datasets. Results demonstrate that LEMKD not only matches the exceptional performance of the most advanced single-stream models but also excels in dual-stream encoder performance amidst the recent integration of single-stream and dual-stream models.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359312
DOIs
StatePublished - 2024
Event2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan
Duration: 30 Jun 20245 Jul 2024

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/TerritoryJapan
CityYokohama
Period30/06/245/07/24

Keywords

  • knowledge distillation
  • multi-modal
  • text-image retrieval

Fingerprint

Dive into the research topics of 'A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval'. Together they form a unique fingerprint.

Cite this