TY - JOUR
T1 - Global Representation Guided Adaptive Fusion Network for Stable Video Crowd Counting
AU - Cai, Yiqing
AU - Ma, Zhenwei
AU - Lu, Changhong
AU - Wang, Changbo
AU - He, Gaoqi
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2023
Y1 - 2023
N2 - Modern crowd counting methods in natural scenes, even when video datasets are available, are mostly based on images. Because of background interference or occlusion in the scene, these methods can easily lead to mutations and instability in density prediction. There has been minimal research on how to exploit the inherent consistency among adjacent frames to achieve high estimation accuracy of video sequences. In this study, we explore the long-term global temporal consistency in the video sequence and propose a novel Global Representation Guided Adaptive Fusion Network (GRGAF) for video crowd counting. The primary aim is to establish a long-term temporal representation among consecutive frames to guide the density estimation of local frames, which can alleviate the prediction instability caused by background noise and occlusions in crowd scenes. Moreover, in order to further enforce the temporal consistency, we apply the generative adversarial learning scheme and design a global-local joint loss, which can make the estimated density maps more temporally coherent. Extensive experiments on four challenging video-based crowd counting datasets (FDST, DroneCrowd, MALL and UCSD) demonstrate that our method makes effective use of spatio-temporal information of video and outperforms the other state-of-the-art approach.
AB - Modern crowd counting methods in natural scenes, even when video datasets are available, are mostly based on images. Because of background interference or occlusion in the scene, these methods can easily lead to mutations and instability in density prediction. There has been minimal research on how to exploit the inherent consistency among adjacent frames to achieve high estimation accuracy of video sequences. In this study, we explore the long-term global temporal consistency in the video sequence and propose a novel Global Representation Guided Adaptive Fusion Network (GRGAF) for video crowd counting. The primary aim is to establish a long-term temporal representation among consecutive frames to guide the density estimation of local frames, which can alleviate the prediction instability caused by background noise and occlusions in crowd scenes. Moreover, in order to further enforce the temporal consistency, we apply the generative adversarial learning scheme and design a global-local joint loss, which can make the estimated density maps more temporally coherent. Extensive experiments on four challenging video-based crowd counting datasets (FDST, DroneCrowd, MALL and UCSD) demonstrate that our method makes effective use of spatio-temporal information of video and outperforms the other state-of-the-art approach.
KW - Adaptive fusion
KW - crowd counting
KW - global temporal representation
KW - spatio-temporal consistency
KW - video understanding
UR - https://www.scopus.com/pages/publications/85134229784
U2 - 10.1109/TMM.2022.3189246
DO - 10.1109/TMM.2022.3189246
M3 - 文章
AN - SCOPUS:85134229784
SN - 1520-9210
VL - 25
SP - 5222
EP - 5233
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -