FESNET: SPOTTING FACIAL EXPRESSIONS USING LOCAL SPATIAL DISCREPANCY AND MULTI-SCALE TEMPORAL AGGREGATION

Bohao Zhang, Jiale Lu, Changbo Wang, Gaoqi He

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Facial expressions (FEs) spotting aims to split long videos into intervals of neutral expression, macro-expression, or micro-expression. Recent works mainly focus on feature descriptor or optical flow methods, suffering from difficulty capturing subtle facial motion and efficient temporal aggregation. This paper proposes a novel end-to-end network, named FESNet (Facial Expression Spotting Network), to solve the above challenges. The main idea is to model the subtle facial motion as local spatial discrepancy and incorporate temporal correlation by multi-scale temporal convolution. The FESNet comprises a local spatial discrepancy module (LSDM) and a multi-scale temporal aggregation module (MTAM). The LSDM first extracts the static spatial features from each frame by residual convolution and learns the inner spatial correlation by multi-head attention. Moreover, the subtle facial motion of facial expression is modeled as the discrepancy between the first frame and the current frame of the input interval, making frame-wise spatial proposals. Using the local spatial discrepancy features and proposals as input, the MTAM incorporates the temporal correlation by multi-scale temporal convolution and performs cascade refinement to make the final prediction. Furthermore, this paper proposes a smooth loss to ensure the temporal consistency of the cascade refined proposals from MTAM. Comprehensive experiments show that FESNet achieves competitive performance compared to state-of-the-art methods.

Original languageEnglish
Pages (from-to)458-481
Number of pages24
JournalComputing and Informatics
Volume43
Issue number2
DOIs
StatePublished - 2024

Keywords

  • Facial expression analysis
  • convolutional neural networks
  • micro-expression spotting
  • video understanding

Fingerprint

Dive into the research topics of 'FESNET: SPOTTING FACIAL EXPRESSIONS USING LOCAL SPATIAL DISCREPANCY AND MULTI-SCALE TEMPORAL AGGREGATION'. Together they form a unique fingerprint.

Cite this