A Novel Lightweight Audio-visual Saliency Model for Videos

Dandan Zhu, Xuan Shao, Qiangqiang Zhou, Xiongkuo Min, Guangtao Zhai, Xiaokang Yang

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention models only utilize visual information, their performance is limited but also requires high-computational complexity due to the limited information available. To overcome these problems, we propose a lightweight audio-visual saliency (LAVS) model for video sequences. To the best of our knowledge, this article is the first trial to utilize audio cues for an efficient deep-learning model for the video saliency estimation. First, spatial-temporal visual features are extracted by the lightweight receptive field block (RFB) with the bidirectional ConvLSTM units. Then, audio features are extracted by using an improved lightweight environment sound classification model. Subsequently, deep canonical correlation analysis (DCCA) aims at capturing the correspondence between audio and spatial-temporal visual features, thus obtaining a spatial-temporal auditory saliency. Lastly, the spatial-temporal visual and auditory saliency are fused to obtain the audio-visual saliency map. Extensive comparative experiments and ablation studies validate the performance of the LAVS model in terms of effectiveness and complexity.

Original languageEnglish
Article number147
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume19
Issue number4
DOIs
StatePublished - 1 Mar 2023

Keywords

  • Lightweight model
  • audio-visual saliency prediction
  • deep canonical correlation analysis
  • feature fusion
  • sound source localization

Fingerprint

Dive into the research topics of 'A Novel Lightweight Audio-visual Saliency Model for Videos'. Together they form a unique fingerprint.

Cite this