STM-SalNet: A Biologically-Inspired Spatial-Temporal Memory Network for Video Saliency Prediction

  • Jikai Xu
  • , Dandan Zhu*
  • , Kaiwei Zhang
  • , Xiongkuo Min
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, video saliency prediction has attracted significant attention across a wide range of vision-related tasks. However, most existing video saliency prediction methods predominantly rely on static encoder-decoder architectures, failing to incorporate the dynamic memory mechanisms that are fundamental to human visual perception and attention modeling. To address this limitation, we propose STM-SalNet, a novel biologically-inspired spatial-temporal memory network for video saliency prediction. First, inspired by the powerful visual processing capabilities of the human visual cortex, we introduce a brain-inspired Vision Transformer module designed to extract multi-level hierarchical spatial-temporal features. Subsequently, we propose a memory bank module equipped with an active forgetting mechanism, simulating human memory’s ability to selectively retain and update information. By dynamically retrieving relevant features from past frames while discarding redundancy, the module ensures robust adaptability to continuously evolving video content. To further enhance the integration of spatial and temporal features, we design a bidirectional spatial-temporal fusion module that facilitates effective interaction between deep semantic and shallow spatial features, enriching the overall feature representation. Finally, a progressively hierarchical decoder module is employed to generate fine-grained, pixel-wise saliency maps that closely align with ground truths. Extensive experiments on the DHF1K, Hollywood-2, and UCF-Sports benchmark datasets demonstrate that our proposed STM-SalNet achieves competitive performance compared to existing state-of-the-art methods.

Original languageEnglish
Title of host publicationNeural Information Processing - 32nd International Conference, ICONIP 2025, Proceedings
EditorsTadahiro Taniguchi, Chi Sing Andrew Leung, Tadashi Kozuno, Junichiro Yoshimoto, Mufti Mahmud, Maryam Doborjeh, Kenji Doya
PublisherSpringer Science and Business Media Deutschland GmbH
Pages335-349
Number of pages15
ISBN (Print)9789819540969
DOIs
StatePublished - 2026
Event32nd International Conference on Neural Information Processing, ICONIP 2025 - Okinawa, Japan
Duration: 20 Nov 202524 Nov 2025

Publication series

NameCommunications in Computer and Information Science
Volume2756 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference32nd International Conference on Neural Information Processing, ICONIP 2025
Country/TerritoryJapan
CityOkinawa
Period20/11/2524/11/25

Keywords

  • Active Forgetting
  • Hippocampus
  • Memory Bank
  • Transformer
  • Video Saliency Prediction

Fingerprint

Dive into the research topics of 'STM-SalNet: A Biologically-Inspired Spatial-Temporal Memory Network for Video Saliency Prediction'. Together they form a unique fingerprint.

Cite this