TY - JOUR
T1 - From Discrete Representation to Continuous Modeling
T2 - A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations
AU - Zhu, Dandan
AU - Zhang, Kaiwei
AU - Zhu, Kun
AU - Zhang, Nana
AU - Ding, Weiping
AU - Zhai, Guangtao
AU - Yang, Xiaokang
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2024
Y1 - 2024
N2 - In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.
AB - In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.
KW - Audio-visual saliency prediction
KW - feature-adaptive
KW - implicit neural representation
KW - parametric feature fusion strategy
UR - https://www.scopus.com/pages/publications/85190743496
U2 - 10.1109/TETCI.2024.3386619
DO - 10.1109/TETCI.2024.3386619
M3 - 文章
AN - SCOPUS:85190743496
SN - 2471-285X
VL - 8
SP - 4059
EP - 4074
JO - IEEE Transactions on Emerging Topics in Computational Intelligence
JF - IEEE Transactions on Emerging Topics in Computational Intelligence
IS - 6
ER -