TY - JOUR
T1 - FESNET
T2 - SPOTTING FACIAL EXPRESSIONS USING LOCAL SPATIAL DISCREPANCY AND MULTI-SCALE TEMPORAL AGGREGATION
AU - Zhang, Bohao
AU - Lu, Jiale
AU - Wang, Changbo
AU - He, Gaoqi
N1 - Publisher Copyright:
© 2024 Slovak Academy of Sciences. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Facial expressions (FEs) spotting aims to split long videos into intervals of neutral expression, macro-expression, or micro-expression. Recent works mainly focus on feature descriptor or optical flow methods, suffering from difficulty capturing subtle facial motion and efficient temporal aggregation. This paper proposes a novel end-to-end network, named FESNet (Facial Expression Spotting Network), to solve the above challenges. The main idea is to model the subtle facial motion as local spatial discrepancy and incorporate temporal correlation by multi-scale temporal convolution. The FESNet comprises a local spatial discrepancy module (LSDM) and a multi-scale temporal aggregation module (MTAM). The LSDM first extracts the static spatial features from each frame by residual convolution and learns the inner spatial correlation by multi-head attention. Moreover, the subtle facial motion of facial expression is modeled as the discrepancy between the first frame and the current frame of the input interval, making frame-wise spatial proposals. Using the local spatial discrepancy features and proposals as input, the MTAM incorporates the temporal correlation by multi-scale temporal convolution and performs cascade refinement to make the final prediction. Furthermore, this paper proposes a smooth loss to ensure the temporal consistency of the cascade refined proposals from MTAM. Comprehensive experiments show that FESNet achieves competitive performance compared to state-of-the-art methods.
AB - Facial expressions (FEs) spotting aims to split long videos into intervals of neutral expression, macro-expression, or micro-expression. Recent works mainly focus on feature descriptor or optical flow methods, suffering from difficulty capturing subtle facial motion and efficient temporal aggregation. This paper proposes a novel end-to-end network, named FESNet (Facial Expression Spotting Network), to solve the above challenges. The main idea is to model the subtle facial motion as local spatial discrepancy and incorporate temporal correlation by multi-scale temporal convolution. The FESNet comprises a local spatial discrepancy module (LSDM) and a multi-scale temporal aggregation module (MTAM). The LSDM first extracts the static spatial features from each frame by residual convolution and learns the inner spatial correlation by multi-head attention. Moreover, the subtle facial motion of facial expression is modeled as the discrepancy between the first frame and the current frame of the input interval, making frame-wise spatial proposals. Using the local spatial discrepancy features and proposals as input, the MTAM incorporates the temporal correlation by multi-scale temporal convolution and performs cascade refinement to make the final prediction. Furthermore, this paper proposes a smooth loss to ensure the temporal consistency of the cascade refined proposals from MTAM. Comprehensive experiments show that FESNet achieves competitive performance compared to state-of-the-art methods.
KW - Facial expression analysis
KW - convolutional neural networks
KW - micro-expression spotting
KW - video understanding
UR - https://www.scopus.com/pages/publications/85195496822
U2 - 10.31577/cai_2024_2_458
DO - 10.31577/cai_2024_2_458
M3 - 文章
AN - SCOPUS:85195496822
SN - 1335-9150
VL - 43
SP - 458
EP - 481
JO - Computing and Informatics
JF - Computing and Informatics
IS - 2
ER -