TY - JOUR
T1 - CoughSlowFast
T2 - Cough Recognition With Audio and Video Signal Fusion
AU - Feng, Mingke
AU - Zhai, Guangtao
AU - Zhang, Xiao Ping
AU - Hu, Menghan
N1 - Publisher Copyright:
© 2025 IEEE. All rights reserved.
PY - 2025
Y1 - 2025
N2 - The recognition of coughs plays a critical role in the diagnosis of respiratory diseases and the monitoring of public health. Traditional audio-based methods are highly susceptible to noise and lack spatial awareness, while visual methods struggle to recognize low-amplitude cough motions and are prone to confusion with other behaviors. To address these limitations, this letter proposes a multimodal cough recognition model, CoughSlowFast, which extends the SlowFast architecture by introducing a high-sampling-rate audio branch and designing a peak-aware masking mechanism to enhance the model responsiveness to key frames. A temporal fusion strategy is employed to effectively integrate low-frequency structural motion, high-frequency dynamic variations, and transient audio features. Evaluated on a self-constructed multimodal cough dataset containing 9,254 synchronized audio–video samples, CoughSlowFast achieves an accuracy of 95.91% and an F1-score of 0.9148 under complex environmental conditions, significantly outperforming mainstream models including CSN, SlowFast, VideoSwin, Neural Cough Counter, and AVE, thus demonstrating strong potential for real-world deployment.
AB - The recognition of coughs plays a critical role in the diagnosis of respiratory diseases and the monitoring of public health. Traditional audio-based methods are highly susceptible to noise and lack spatial awareness, while visual methods struggle to recognize low-amplitude cough motions and are prone to confusion with other behaviors. To address these limitations, this letter proposes a multimodal cough recognition model, CoughSlowFast, which extends the SlowFast architecture by introducing a high-sampling-rate audio branch and designing a peak-aware masking mechanism to enhance the model responsiveness to key frames. A temporal fusion strategy is employed to effectively integrate low-frequency structural motion, high-frequency dynamic variations, and transient audio features. Evaluated on a self-constructed multimodal cough dataset containing 9,254 synchronized audio–video samples, CoughSlowFast achieves an accuracy of 95.91% and an F1-score of 0.9148 under complex environmental conditions, significantly outperforming mainstream models including CSN, SlowFast, VideoSwin, Neural Cough Counter, and AVE, thus demonstrating strong potential for real-world deployment.
KW - audio-visual signal processing
KW - Cough recognition
KW - multimodal fusion
KW - peak masking
UR - https://www.scopus.com/pages/publications/105017403052
U2 - 10.1109/LSP.2025.3612351
DO - 10.1109/LSP.2025.3612351
M3 - 文章
AN - SCOPUS:105017403052
SN - 1070-9908
VL - 32
SP - 3774
EP - 3778
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -