Skip to main navigation Skip to search Skip to main content

LFG-LLM: A multimodal large language model-based audio–visual cough detection framework

  • Mingke Feng
  • , Yingying Han*
  • , Menghan Hu*
  • *Corresponding author for this work
  • University of Shanghai for Science and Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Cough detection in real-world environments faces significant challenges. Audio modalities are highly susceptible to environmental noise, while visual modalities suffer from the short duration and small motion amplitude of cough actions, which are also highly similar to behaviors such as sneezing and throat clearing, making stable and reliable recognition difficult. To address these challenges, this paper proposes a cough detection framework based on multimodal large language models. On the visual side, two-dimensional skeletal keypoint sequences are employed to model temporal motion features. At the semantic level, a hierarchical LLM architecture is introduced, where audio and skeletal features are separately analyzed to extract high-level soft features, including audio cough confidence and cross-modal temporal alignment. At the decision level, a rule-enhanced Learnable Fusion Gate (LFG) is designed to integrate noise awareness, alignment regulation, and lightweight nonlinear calibration into a unified probabilistic fusion framework, enabling adaptive multimodal modeling. Experiments on a self-collected dataset of 1024 real-world audio–visual samples demonstrate that the proposed method achieves stable and strong performance. In particular, the combination of Qwen3-Omni-Flash and LFG yields the best results, with an accuracy of 0.9229 and an F1-score of 0.9200, significantly outperforming unimodal baselines and the Direct-LLM scheme without explicit fusion.

Original languageEnglish
Article number103482
JournalDisplays
Volume94
DOIs
StatePublished - Sep 2026

Keywords

  • Audio–visual analysis
  • Cough detection
  • Multimodal fusion
  • Multimodal large language models

Fingerprint

Dive into the research topics of 'LFG-LLM: A multimodal large language model-based audio–visual cough detection framework'. Together they form a unique fingerprint.

Cite this