TY - JOUR
T1 - Explain Vision Focus
T2 - Blending Human Saliency Into Synthetic Face Images
AU - Zhang, Kaiwei
AU - Zhu, Dandan
AU - Min, Xiongkuo
AU - Duan, Huiyu
AU - Zhai, Guangtao
N1 - Publisher Copyright:
© 2024 IEEE. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Synthetic faces have been extensively researched and applied in various fields, such as face parsing and recognition. Compared to real face images, synthetic faces engender more controllable and consistent experimental stimuli due to the ability to precisely merge expression animations onto the facial skeleton. Accordingly, we establish an eye-tracking database with 780 synthetic face images and fixation data collected from 22 participants. The use of synthetic images with consistent expressions ensures reliable data support for exploring the database and determining the following findings: (1) A correlation study between saliency intensity and facial movement reveals that the variation of attention distribution within facial regions is mainly attributed to the movement of the mouth. (2) A categorized analysis of different demographic factors demonstrates that the bias towards salient regions aligns with differences in some demographic categories of synthetic characters. In practice, inference of facial saliency distribution is commonly used to predict the regions of interest for facial video-related applications. Therefore, we propose a benchmark model that accurately predicts saliency maps, closely matching the ground truth annotations. This achievement is made possible by utilizing channel alignment and progressive summation for feature fusion, along with the incorporation of Sinusoidal Position Encoding. The ablation experiment also demonstrates the effectiveness of our proposed model. We hope that this paper will contribute to advancing the photorealism of generative digital humans.
AB - Synthetic faces have been extensively researched and applied in various fields, such as face parsing and recognition. Compared to real face images, synthetic faces engender more controllable and consistent experimental stimuli due to the ability to precisely merge expression animations onto the facial skeleton. Accordingly, we establish an eye-tracking database with 780 synthetic face images and fixation data collected from 22 participants. The use of synthetic images with consistent expressions ensures reliable data support for exploring the database and determining the following findings: (1) A correlation study between saliency intensity and facial movement reveals that the variation of attention distribution within facial regions is mainly attributed to the movement of the mouth. (2) A categorized analysis of different demographic factors demonstrates that the bias towards salient regions aligns with differences in some demographic categories of synthetic characters. In practice, inference of facial saliency distribution is commonly used to predict the regions of interest for facial video-related applications. Therefore, we propose a benchmark model that accurately predicts saliency maps, closely matching the ground truth annotations. This achievement is made possible by utilizing channel alignment and progressive summation for feature fusion, along with the incorporation of Sinusoidal Position Encoding. The ablation experiment also demonstrates the effectiveness of our proposed model. We hope that this paper will contribute to advancing the photorealism of generative digital humans.
KW - Eye tracking
KW - saliency prediction
KW - synthetic face image
KW - visual attention
UR - https://www.scopus.com/pages/publications/85213833380
U2 - 10.1109/TMM.2024.3521670
DO - 10.1109/TMM.2024.3521670
M3 - 文章
AN - SCOPUS:85213833380
SN - 1520-9210
VL - 27
SP - 489
EP - 502
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -