KD-VSUM: A Vision Guided Models for Multimodal Abstractive Summarization with Knowledge Distillation

Zehong Zheng, Changlong Li, Wenxin Hu, Su Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Multimodal abstract summarization is increasingly attracting attention due to its ability to synthesize information from different source modalities and generate high-quality text summaries. Concurrently, there has been significant development in multimodal abstract summarization models for videos. These models are capable of extracting information from multimodal data and generating abstract summaries. Most existing modeling approaches primarily concentrate on instructional videos, such as those teaching sports or life skills, thereby limiting their ability to capture the complexity of dynamic environments in the general world. In this paper, we propose a vision-guided model for multimodal abstractive summarization with knowledge distillation KD-VSUM to address the lack of generalized video domain capabilities in video summarization. This approach includes a vision-guided encoder, which enables the model to better focus on the global spatial and temporal information of video frames. We capitalize on knowledge distillation from multimodal pre-trained video-language models to enhance model performance. We introduce the VersaVision dataset, which includes a broader range of video domains and a higher proportion of medium to long videos. The results demonstrate that our model surpasses existing state-of-the-art models on the VersaVision dataset, achieving ROUGE scores of 1.7 in ROUGE-1, 1.8 in ROUGE-2, and 2 in ROUGE-L. These findings underscore the substantial improvements that the integration of a global vision guided and knowledge distillation can bring to the task of video summary extraction.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359312
DOIs
StatePublished - 2024
Event2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan
Duration: 30 Jun 20245 Jul 2024

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/TerritoryJapan
CityYokohama
Period30/06/245/07/24

Keywords

  • Abstractive Summarization
  • Knowledge Distillation
  • Multimodality

Fingerprint

Dive into the research topics of 'KD-VSUM: A Vision Guided Models for Multimodal Abstractive Summarization with Knowledge Distillation'. Together they form a unique fingerprint.

Cite this