TY - JOUR
T1 - ScanDTM
T2 - A Novel Dual-Temporal Modulation Scanpath Prediction Model for Omnidirectional Images
AU - Zhu, Dandan
AU - Zhang, Kaiwei
AU - Min, Xiongkuo
AU - Zhai, Guangtao
AU - Yang, Xiaokang
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Scanpath prediction for omnidirectional images aims to effectively simulate the human visual perception mechanism to generate dynamic realistic fixation trajectories. However, the majority of scanpath prediction methods for omnidirectional images are still in their infancy as they fail to accurately capture the time-dependency of viewing behavior and suffer from sub-optimal performance along with limited generalization capability. A desirable solution should achieve a better trade-off between prediction performance and generalization ability. To this end, we propose a novel dual-temporal modulation scanpath prediction (ScanDTM) model for omnidirectional images. Such a model is designed to effectively capture long-range time-dependencies between various fixation regions across both internal and external time dimensions, thereby generating more realistic scanpaths. In particular, we design a Dual Graph Convolutional Network (Dual-GCN) module comprising a semantic-level GCN and an image-level GCN. This module servers as a robust visual encoder that captures spatial relationships among various object regions within an image and fully utilizes similar images as complementary information to capture similarity relations across relevant images. Notably, the proposed Dual-GCN focuses on modeling temporal correlations from both local and global perspectives within the internal time dimension. Furthermore, drawing inspiration from the promising generalization capabilities of diffusion models across various generative tasks, we introduce a novel diffusion-guided saliency module. This module formulates the prediction issue as a conditional generative process for the saliency map, utilizing extracted semantic-level and image-level visual features as conditions. With the well-designed diffusion-guided saliency module, our proposed ScanDTM model acting as an external temporal modulator, we can progressively refine the generated scanpath from the noisy map. We conduct extensive experiments on several benchmark datasets, and the results demonstrate that our ScanDTM model significantly outperforms other competitors. Meanwhile, when applied to tasks such as saliency prediction and image quality assessment, our ScanDTM model consistently achieves superior generalization performance.
AB - Scanpath prediction for omnidirectional images aims to effectively simulate the human visual perception mechanism to generate dynamic realistic fixation trajectories. However, the majority of scanpath prediction methods for omnidirectional images are still in their infancy as they fail to accurately capture the time-dependency of viewing behavior and suffer from sub-optimal performance along with limited generalization capability. A desirable solution should achieve a better trade-off between prediction performance and generalization ability. To this end, we propose a novel dual-temporal modulation scanpath prediction (ScanDTM) model for omnidirectional images. Such a model is designed to effectively capture long-range time-dependencies between various fixation regions across both internal and external time dimensions, thereby generating more realistic scanpaths. In particular, we design a Dual Graph Convolutional Network (Dual-GCN) module comprising a semantic-level GCN and an image-level GCN. This module servers as a robust visual encoder that captures spatial relationships among various object regions within an image and fully utilizes similar images as complementary information to capture similarity relations across relevant images. Notably, the proposed Dual-GCN focuses on modeling temporal correlations from both local and global perspectives within the internal time dimension. Furthermore, drawing inspiration from the promising generalization capabilities of diffusion models across various generative tasks, we introduce a novel diffusion-guided saliency module. This module formulates the prediction issue as a conditional generative process for the saliency map, utilizing extracted semantic-level and image-level visual features as conditions. With the well-designed diffusion-guided saliency module, our proposed ScanDTM model acting as an external temporal modulator, we can progressively refine the generated scanpath from the noisy map. We conduct extensive experiments on several benchmark datasets, and the results demonstrate that our ScanDTM model significantly outperforms other competitors. Meanwhile, when applied to tasks such as saliency prediction and image quality assessment, our ScanDTM model consistently achieves superior generalization performance.
KW - Scanpath prediction
KW - diffusion model
KW - dual graph convolutional network
KW - dual temporal modulator
KW - omnidirectional image
UR - https://www.scopus.com/pages/publications/86000451736
U2 - 10.1109/TCSVT.2025.3545908
DO - 10.1109/TCSVT.2025.3545908
M3 - 文章
AN - SCOPUS:86000451736
SN - 1051-8215
VL - 35
SP - 7850
EP - 7865
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 8
ER -