TY - JOUR
T1 - Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
AU - Zhu, Dandan
AU - Shao, Xuan
AU - Zhang, Kaiwei
AU - Min, Xiongkuo
AU - Zhai, Guangtao
AU - Yang, Xiaokang
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/10
Y1 - 2023/10
N2 - Since the audio information is fully explored and leveraged in omnidirectional videos (ODVs), the performance of existing audio-visual saliency models has been improving dramatically and significantly. However, these models are still in their infancy stages, and there are two significant issues in modeling human attention between visual and auditory modalities: (1) Temporal non-alignment problem between auditory and visual modalities is rarely considered; (2) Most audio-visual saliency models are audio content attributes-agnostic. Thus, they need to learn audio features with fine details. This paper proposes a novel audio-visual aligned saliency (AVAS) model that can simultaneously tackle two issues as mentioned above in an effective end-to-end training manner. In order to solve the temporal non-alignment problem between the two modalities, a Hanning window method is employed on the audio stream to truncate the audio signal per unit time (frame-time interval) to match the visual information stream of the corresponding duration, which can capture the potential correlation of two modalities across time steps and facilitate audio-visual features fusion. Regarding the audio content attribute-agnostic issue, an effective periodic audio encoding method is proposed based on implicit neural representation (INR) to map audio sampling points to their corresponding audio frequency values, which can better discriminate and interpret audio content attributes. Comprehensive experiments and detailed ablation analyses are performed on the benchmark dataset to demonstrate the efficacy of the proposed model. The experimental results indicate that the proposed model consistently outperforms other competitors by a large margin.
AB - Since the audio information is fully explored and leveraged in omnidirectional videos (ODVs), the performance of existing audio-visual saliency models has been improving dramatically and significantly. However, these models are still in their infancy stages, and there are two significant issues in modeling human attention between visual and auditory modalities: (1) Temporal non-alignment problem between auditory and visual modalities is rarely considered; (2) Most audio-visual saliency models are audio content attributes-agnostic. Thus, they need to learn audio features with fine details. This paper proposes a novel audio-visual aligned saliency (AVAS) model that can simultaneously tackle two issues as mentioned above in an effective end-to-end training manner. In order to solve the temporal non-alignment problem between the two modalities, a Hanning window method is employed on the audio stream to truncate the audio signal per unit time (frame-time interval) to match the visual information stream of the corresponding duration, which can capture the potential correlation of two modalities across time steps and facilitate audio-visual features fusion. Regarding the audio content attribute-agnostic issue, an effective periodic audio encoding method is proposed based on implicit neural representation (INR) to map audio sampling points to their corresponding audio frequency values, which can better discriminate and interpret audio content attributes. Comprehensive experiments and detailed ablation analyses are performed on the benchmark dataset to demonstrate the efficacy of the proposed model. The experimental results indicate that the proposed model consistently outperforms other competitors by a large margin.
KW - Audio-visual saliency
KW - Implicit neural representation
KW - Omnidirectional videos
KW - Spatial sound source localization
KW - Temporal alignment
UR - https://www.scopus.com/pages/publications/85163739721
U2 - 10.1007/s10489-023-04714-1
DO - 10.1007/s10489-023-04714-1
M3 - 文章
AN - SCOPUS:85163739721
SN - 0924-669X
VL - 53
SP - 22615
EP - 22634
JO - Applied Intelligence
JF - Applied Intelligence
IS - 19
ER -