跳到主要导航 跳到搜索 跳到主要内容

MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model with Multi-Modal Transformer

  • Dandan Zhu
  • , Kun Zhu*
  • , Weiping Ding
  • , Nana Zhang*
  • , Xiongkuo Min
  • , Guangtao Zhai
  • , Xiaokang Yang
  • *此作品的通讯作者
  • Ministry of Education of the People's Republic of China
  • Tongji University
  • Nantong University
  • Donghua University
  • Shanghai Jiao Tong University

科研成果: 期刊稿件文章同行评审

摘要

Although various video saliency models have achieved considerable performance gains, existing deep learning-based audio-visual saliency prediction models are still in the early exploration stage. The major challenge is that there are relatively few audio-visual sequences with real human eye fixations collected under the audio-visual circumstance. To this end, this paper presents a novel multi-modal transformer-based class activation mapping (MTCAM) model in a weakly-supervised training manner to effectively alleviate the need of large-scale datasets for audio-visual saliency prediction. In particular, by using only video category labels in the video classification task, we propose to employ the class activation mapping based on multi-modal transformer, which follows a two-stage training methodology to extract the most discriminative regions. Such regions with strong discriminative ability are highly consistent with real human eye fixations. Meanwhile, we further devise an efficient feature reuse mechanism to reduce redundant computation and enable previously obtained features can provide effective guidance for downstream model learning. It is particularly noteworthy that this work is the first attempt to exploit the cross-modal transformer to focus on cross-modal interaction at the entire video and predict human eye fixations in a weakly-supervised training strategy. We conduct extensive experiments on several benchmark datasets to demonstrate that the proposed MTCAM model significantly outperforms other competitors. Furthermore, detailed ablation experiments are also performed to validate the effectiveness and rationality of each component in our proposed model.

源语言英语
页(从-至)1756-1771
页数16
期刊IEEE Transactions on Emerging Topics in Computational Intelligence
8
2
DOI
出版状态已出版 - 1 4月 2024

指纹

探究 'MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model with Multi-Modal Transformer' 的科研主题。它们共同构成独一无二的指纹。

引用此