CoughSlowFast: Cough Recognition With Audio and Video Signal Fusion

Mingke Feng, Guangtao Zhai, Xiao Ping Zhang, Menghan Hu

Research output: Contribution to journalArticlepeer-review

Abstract

The recognition of coughs plays a critical role in the diagnosis of respiratory diseases and the monitoring of public health. Traditional audio-based methods are highly susceptible to noise and lack spatial awareness, while visual methods struggle to recognize low-amplitude cough motions and are prone to confusion with other behaviors. To address these limitations, this letter proposes a multimodal cough recognition model, CoughSlowFast, which extends the SlowFast architecture by introducing a high-sampling-rate audio branch and designing a peak-aware masking mechanism to enhance the model responsiveness to key frames. A temporal fusion strategy is employed to effectively integrate low-frequency structural motion, high-frequency dynamic variations, and transient audio features. Evaluated on a self-constructed multimodal cough dataset containing 9,254 synchronized audio–video samples, CoughSlowFast achieves an accuracy of 95.91% and an F1-score of 0.9148 under complex environmental conditions, significantly outperforming mainstream models including CSN, SlowFast, VideoSwin, Neural Cough Counter, and AVE, thus demonstrating strong potential for real-world deployment.

Original languageEnglish
Pages (from-to)3774-3778
Number of pages5
JournalIEEE Signal Processing Letters
Volume32
DOIs
StatePublished - 2025

Keywords

  • audio-visual signal processing
  • Cough recognition
  • multimodal fusion
  • peak masking

Fingerprint

Dive into the research topics of 'CoughSlowFast: Cough Recognition With Audio and Video Signal Fusion'. Together they form a unique fingerprint.

Cite this