TY - GEN
T1 - KD-VSUM
T2 - 2024 International Joint Conference on Neural Networks, IJCNN 2024
AU - Zheng, Zehong
AU - Li, Changlong
AU - Hu, Wenxin
AU - Wang, Su
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Multimodal abstract summarization is increasingly attracting attention due to its ability to synthesize information from different source modalities and generate high-quality text summaries. Concurrently, there has been significant development in multimodal abstract summarization models for videos. These models are capable of extracting information from multimodal data and generating abstract summaries. Most existing modeling approaches primarily concentrate on instructional videos, such as those teaching sports or life skills, thereby limiting their ability to capture the complexity of dynamic environments in the general world. In this paper, we propose a vision-guided model for multimodal abstractive summarization with knowledge distillation KD-VSUM to address the lack of generalized video domain capabilities in video summarization. This approach includes a vision-guided encoder, which enables the model to better focus on the global spatial and temporal information of video frames. We capitalize on knowledge distillation from multimodal pre-trained video-language models to enhance model performance. We introduce the VersaVision dataset, which includes a broader range of video domains and a higher proportion of medium to long videos. The results demonstrate that our model surpasses existing state-of-the-art models on the VersaVision dataset, achieving ROUGE scores of 1.7 in ROUGE-1, 1.8 in ROUGE-2, and 2 in ROUGE-L. These findings underscore the substantial improvements that the integration of a global vision guided and knowledge distillation can bring to the task of video summary extraction.
AB - Multimodal abstract summarization is increasingly attracting attention due to its ability to synthesize information from different source modalities and generate high-quality text summaries. Concurrently, there has been significant development in multimodal abstract summarization models for videos. These models are capable of extracting information from multimodal data and generating abstract summaries. Most existing modeling approaches primarily concentrate on instructional videos, such as those teaching sports or life skills, thereby limiting their ability to capture the complexity of dynamic environments in the general world. In this paper, we propose a vision-guided model for multimodal abstractive summarization with knowledge distillation KD-VSUM to address the lack of generalized video domain capabilities in video summarization. This approach includes a vision-guided encoder, which enables the model to better focus on the global spatial and temporal information of video frames. We capitalize on knowledge distillation from multimodal pre-trained video-language models to enhance model performance. We introduce the VersaVision dataset, which includes a broader range of video domains and a higher proportion of medium to long videos. The results demonstrate that our model surpasses existing state-of-the-art models on the VersaVision dataset, achieving ROUGE scores of 1.7 in ROUGE-1, 1.8 in ROUGE-2, and 2 in ROUGE-L. These findings underscore the substantial improvements that the integration of a global vision guided and knowledge distillation can bring to the task of video summary extraction.
KW - Abstractive Summarization
KW - Knowledge Distillation
KW - Multimodality
UR - https://www.scopus.com/pages/publications/85204947880
U2 - 10.1109/IJCNN60899.2024.10651189
DO - 10.1109/IJCNN60899.2024.10651189
M3 - 会议稿件
AN - SCOPUS:85204947880
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 30 June 2024 through 5 July 2024
ER -