TY - JOUR
T1 - MTCAM
T2 - A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model with Multi-Modal Transformer
AU - Zhu, Dandan
AU - Zhu, Kun
AU - Ding, Weiping
AU - Zhang, Nana
AU - Min, Xiongkuo
AU - Zhai, Guangtao
AU - Yang, Xiaokang
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2024/4/1
Y1 - 2024/4/1
N2 - Although various video saliency models have achieved considerable performance gains, existing deep learning-based audio-visual saliency prediction models are still in the early exploration stage. The major challenge is that there are relatively few audio-visual sequences with real human eye fixations collected under the audio-visual circumstance. To this end, this paper presents a novel multi-modal transformer-based class activation mapping (MTCAM) model in a weakly-supervised training manner to effectively alleviate the need of large-scale datasets for audio-visual saliency prediction. In particular, by using only video category labels in the video classification task, we propose to employ the class activation mapping based on multi-modal transformer, which follows a two-stage training methodology to extract the most discriminative regions. Such regions with strong discriminative ability are highly consistent with real human eye fixations. Meanwhile, we further devise an efficient feature reuse mechanism to reduce redundant computation and enable previously obtained features can provide effective guidance for downstream model learning. It is particularly noteworthy that this work is the first attempt to exploit the cross-modal transformer to focus on cross-modal interaction at the entire video and predict human eye fixations in a weakly-supervised training strategy. We conduct extensive experiments on several benchmark datasets to demonstrate that the proposed MTCAM model significantly outperforms other competitors. Furthermore, detailed ablation experiments are also performed to validate the effectiveness and rationality of each component in our proposed model.
AB - Although various video saliency models have achieved considerable performance gains, existing deep learning-based audio-visual saliency prediction models are still in the early exploration stage. The major challenge is that there are relatively few audio-visual sequences with real human eye fixations collected under the audio-visual circumstance. To this end, this paper presents a novel multi-modal transformer-based class activation mapping (MTCAM) model in a weakly-supervised training manner to effectively alleviate the need of large-scale datasets for audio-visual saliency prediction. In particular, by using only video category labels in the video classification task, we propose to employ the class activation mapping based on multi-modal transformer, which follows a two-stage training methodology to extract the most discriminative regions. Such regions with strong discriminative ability are highly consistent with real human eye fixations. Meanwhile, we further devise an efficient feature reuse mechanism to reduce redundant computation and enable previously obtained features can provide effective guidance for downstream model learning. It is particularly noteworthy that this work is the first attempt to exploit the cross-modal transformer to focus on cross-modal interaction at the entire video and predict human eye fixations in a weakly-supervised training strategy. We conduct extensive experiments on several benchmark datasets to demonstrate that the proposed MTCAM model significantly outperforms other competitors. Furthermore, detailed ablation experiments are also performed to validate the effectiveness and rationality of each component in our proposed model.
KW - Weakly-supervised training strategy
KW - audio-visual saliency prediction
KW - cross-modal transformer
KW - feature reuse mechanism
KW - two-stage training methodology
UR - https://www.scopus.com/pages/publications/85184799237
U2 - 10.1109/TETCI.2024.3358184
DO - 10.1109/TETCI.2024.3358184
M3 - 文章
AN - SCOPUS:85184799237
SN - 2471-285X
VL - 8
SP - 1756
EP - 1771
JO - IEEE Transactions on Emerging Topics in Computational Intelligence
JF - IEEE Transactions on Emerging Topics in Computational Intelligence
IS - 2
ER -