TY - GEN
T1 - Dual focus attention network for video emotion recognition
AU - Qiu, Haonan
AU - He, Liang
AU - Wang, Feng
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - Video emotion recognition is a challenging task due to complex scenes and various forms of emotion expression. Most existing works focus on fusing multiple features over the whole video clips. According to our observations, given a long video clip, the emotion is usually presented by only several actions/objects in a few short snippets, and the meaningful cues are buried in the noisy background. When human judging the emotion in videos, we first find the informative clips and then closely look for emotional cues in the frames. In this paper, we propose Dual Focus Attention Network to mimic this process. First, three kinds of features including action, object, and scene are extracted from videos. Second, Two attention modules are used to focus on the visual features of the videos from temporal and spatial dimensions respectively. With our dual focus attention network, we can effectively discover the most emotional frames along the time dimension and the most emotional visual cues in each frame. Our experiments conducted on two widely used datasets Ekman and VideoEmotion show that our proposed approach outperforms the existing approaches.
AB - Video emotion recognition is a challenging task due to complex scenes and various forms of emotion expression. Most existing works focus on fusing multiple features over the whole video clips. According to our observations, given a long video clip, the emotion is usually presented by only several actions/objects in a few short snippets, and the meaningful cues are buried in the noisy background. When human judging the emotion in videos, we first find the informative clips and then closely look for emotional cues in the frames. In this paper, we propose Dual Focus Attention Network to mimic this process. First, three kinds of features including action, object, and scene are extracted from videos. Second, Two attention modules are used to focus on the visual features of the videos from temporal and spatial dimensions respectively. With our dual focus attention network, we can effectively discover the most emotional frames along the time dimension and the most emotional visual cues in each frame. Our experiments conducted on two widely used datasets Ekman and VideoEmotion show that our proposed approach outperforms the existing approaches.
KW - Attention for video
KW - Deep learning
KW - Video emotion recognition
UR - https://www.scopus.com/pages/publications/85090383358
U2 - 10.1109/ICME46284.2020.9102808
DO - 10.1109/ICME46284.2020.9102808
M3 - 会议稿件
AN - SCOPUS:85090383358
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2020 IEEE International Conference on Multimedia and Expo, ICME 2020
PB - IEEE Computer Society
T2 - 2020 IEEE International Conference on Multimedia and Expo, ICME 2020
Y2 - 6 July 2020 through 10 July 2020
ER -