Abstract
Cough detection in real-world environments faces significant challenges. Audio modalities are highly susceptible to environmental noise, while visual modalities suffer from the short duration and small motion amplitude of cough actions, which are also highly similar to behaviors such as sneezing and throat clearing, making stable and reliable recognition difficult. To address these challenges, this paper proposes a cough detection framework based on multimodal large language models. On the visual side, two-dimensional skeletal keypoint sequences are employed to model temporal motion features. At the semantic level, a hierarchical LLM architecture is introduced, where audio and skeletal features are separately analyzed to extract high-level soft features, including audio cough confidence and cross-modal temporal alignment. At the decision level, a rule-enhanced Learnable Fusion Gate (LFG) is designed to integrate noise awareness, alignment regulation, and lightweight nonlinear calibration into a unified probabilistic fusion framework, enabling adaptive multimodal modeling. Experiments on a self-collected dataset of 1024 real-world audio–visual samples demonstrate that the proposed method achieves stable and strong performance. In particular, the combination of Qwen3-Omni-Flash and LFG yields the best results, with an accuracy of 0.9229 and an F1-score of 0.9200, significantly outperforming unimodal baselines and the Direct-LLM scheme without explicit fusion.
| Original language | English |
|---|---|
| Article number | 103482 |
| Journal | Displays |
| Volume | 94 |
| DOIs | |
| State | Published - Sep 2026 |
Keywords
- Audio–visual analysis
- Cough detection
- Multimodal fusion
- Multimodal large language models
Fingerprint
Dive into the research topics of 'LFG-LLM: A multimodal large language model-based audio–visual cough detection framework'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver