TY - GEN
T1 - TDiffSal
T2 - 27th International Conference on Pattern Recognition, ICPR 2024
AU - Zhang, Nana
AU - Xiong, Min
AU - Zhu, Dandan
AU - Zhu, Kun
AU - Zhai, Guangtao
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Existing visual saliency prediction methods mainly focus on single-modal visual saliency prediction, while ignoring the significant impact of text on visual saliency. To more comprehensively explore the influence of text on human attention in images, we propose a text-guided diffusion saliency prediction model, named TDiffSal. In specific, recent studies on stable diffusion models have shown promising performance in unifying tasks due to their inherent generalization ability. Inspired by this, a novel diffusion model for generalized visual-text saliency prediction is proposed, which formulates the prediction issue as a conditional generative task of the saliency map by employing input visual and text as the conditions. Meanwhile, we introduce a multi-head fusion module to effectively integrate text features and image features, which can efficiently guide the image denoising process and progressively refine the generated saliency map to make it semantically relevant to the text. Additionally, we employ an efficient pre-training strategy to enhance the robustness and generalization of the proposed model. We conduct extensive experiments on benchmark datasets to demonstrate its superior performance compared to other state-of-the-art methods.
AB - Existing visual saliency prediction methods mainly focus on single-modal visual saliency prediction, while ignoring the significant impact of text on visual saliency. To more comprehensively explore the influence of text on human attention in images, we propose a text-guided diffusion saliency prediction model, named TDiffSal. In specific, recent studies on stable diffusion models have shown promising performance in unifying tasks due to their inherent generalization ability. Inspired by this, a novel diffusion model for generalized visual-text saliency prediction is proposed, which formulates the prediction issue as a conditional generative task of the saliency map by employing input visual and text as the conditions. Meanwhile, we introduce a multi-head fusion module to effectively integrate text features and image features, which can efficiently guide the image denoising process and progressively refine the generated saliency map to make it semantically relevant to the text. Additionally, we employ an efficient pre-training strategy to enhance the robustness and generalization of the proposed model. We conduct extensive experiments on benchmark datasets to demonstrate its superior performance compared to other state-of-the-art methods.
KW - Stable diffusion
KW - feature fusion
KW - multimodal
KW - saliency prediction
KW - text-guided visual saliency
UR - https://www.scopus.com/pages/publications/85211945106
U2 - 10.1007/978-3-031-78186-5_2
DO - 10.1007/978-3-031-78186-5_2
M3 - 会议稿件
AN - SCOPUS:85211945106
SN - 9783031781858
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 15
EP - 31
BT - Pattern Recognition - 27th International Conference, ICPR 2024, Proceedings
A2 - Antonacopoulos, Apostolos
A2 - Chaudhuri, Subhasis
A2 - Chellappa, Rama
A2 - Liu, Cheng-Lin
A2 - Bhattacharya, Saumik
A2 - Pal, Umapada
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 1 December 2024 through 5 December 2024
ER -