Abstract
Although various video saliency models have achieved considerable performance gains, existing deep learning-based audio-visual saliency prediction models are still in the early exploration stage. The major challenge is that there are relatively few audio-visual sequences with real human eye fixations collected under the audio-visual circumstance. To this end, this paper presents a novel multi-modal transformer-based class activation mapping (MTCAM) model in a weakly-supervised training manner to effectively alleviate the need of large-scale datasets for audio-visual saliency prediction. In particular, by using only video category labels in the video classification task, we propose to employ the class activation mapping based on multi-modal transformer, which follows a two-stage training methodology to extract the most discriminative regions. Such regions with strong discriminative ability are highly consistent with real human eye fixations. Meanwhile, we further devise an efficient feature reuse mechanism to reduce redundant computation and enable previously obtained features can provide effective guidance for downstream model learning. It is particularly noteworthy that this work is the first attempt to exploit the cross-modal transformer to focus on cross-modal interaction at the entire video and predict human eye fixations in a weakly-supervised training strategy. We conduct extensive experiments on several benchmark datasets to demonstrate that the proposed MTCAM model significantly outperforms other competitors. Furthermore, detailed ablation experiments are also performed to validate the effectiveness and rationality of each component in our proposed model.
| Original language | English |
|---|---|
| Pages (from-to) | 1756-1771 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Emerging Topics in Computational Intelligence |
| Volume | 8 |
| Issue number | 2 |
| DOIs | |
| State | Published - 1 Apr 2024 |
Keywords
- Weakly-supervised training strategy
- audio-visual saliency prediction
- cross-modal transformer
- feature reuse mechanism
- two-stage training methodology
Fingerprint
Dive into the research topics of 'MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model with Multi-Modal Transformer'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver