TY - JOUR
T1 - Audio-Visual Saliency Prediction Model with Implicit Neural Representation
AU - Zhang, Nana
AU - Xiong, Min
AU - Zhu, Dandan
AU - Zhu, Kun
AU - Zhai, Guangtao
AU - Yang, Xiaokang
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/4/7
Y1 - 2025/4/7
N2 - With the remarkable advancement of deep learning techniques and the wide availability of large-scale datasets, the performance of audio-visual saliency prediction has been drastically improved. Actually, audio-visual saliency prediction is still at an early exploration stage due to the spatial-temporal signal complexity and dynamic continuity of video content. To our knowledge, most existing audio-visual saliency prediction approaches usually represent videos as 3D grid of RGB values using discrete convolutional neural networks (CNNs), which inevitably incurs video content-agnostic and ignores the dynamic continuity issues. This article proposes a novel parametric audio-visual saliency (PAVS) model with implicit neural representation (INR) to address the aforementioned problems. Specifically, by using the proposed parametric neural network, we can effectively encode the space-time coordinates of video frames into corresponding saliency values, which can significantly enhance the compact feature representation ability. Meanwhile, a parametric feature fusion method is developed to achieve intrinsic interactions between audio and visual information streams, which can adaptively fuse audio and visual features to obtain competitive performance. Notably, without resorting to any specific audio-visual feature fusion strategy, the proposed PAVS model outperforms other state-of-the-art saliency methods by a large margin.
AB - With the remarkable advancement of deep learning techniques and the wide availability of large-scale datasets, the performance of audio-visual saliency prediction has been drastically improved. Actually, audio-visual saliency prediction is still at an early exploration stage due to the spatial-temporal signal complexity and dynamic continuity of video content. To our knowledge, most existing audio-visual saliency prediction approaches usually represent videos as 3D grid of RGB values using discrete convolutional neural networks (CNNs), which inevitably incurs video content-agnostic and ignores the dynamic continuity issues. This article proposes a novel parametric audio-visual saliency (PAVS) model with implicit neural representation (INR) to address the aforementioned problems. Specifically, by using the proposed parametric neural network, we can effectively encode the space-time coordinates of video frames into corresponding saliency values, which can significantly enhance the compact feature representation ability. Meanwhile, a parametric feature fusion method is developed to achieve intrinsic interactions between audio and visual information streams, which can adaptively fuse audio and visual features to obtain competitive performance. Notably, without resorting to any specific audio-visual feature fusion strategy, the proposed PAVS model outperforms other state-of-the-art saliency methods by a large margin.
KW - Implicit neural representation
KW - audio-visual saliency prediction
KW - generative model
KW - parameterized feature fusion method
UR - https://www.scopus.com/pages/publications/105003711093
U2 - 10.1145/3698881
DO - 10.1145/3698881
M3 - 文章
AN - SCOPUS:105003711093
SN - 1551-6857
VL - 21
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 4
M1 - 117
ER -