跳到主要导航 跳到搜索 跳到主要内容

LFG-LLM: A multimodal large language model-based audio–visual cough detection framework

  • Mingke Feng
  • , Yingying Han*
  • , Menghan Hu*
  • *此作品的通讯作者
  • University of Shanghai for Science and Technology

科研成果: 期刊稿件文章同行评审

摘要

Cough detection in real-world environments faces significant challenges. Audio modalities are highly susceptible to environmental noise, while visual modalities suffer from the short duration and small motion amplitude of cough actions, which are also highly similar to behaviors such as sneezing and throat clearing, making stable and reliable recognition difficult. To address these challenges, this paper proposes a cough detection framework based on multimodal large language models. On the visual side, two-dimensional skeletal keypoint sequences are employed to model temporal motion features. At the semantic level, a hierarchical LLM architecture is introduced, where audio and skeletal features are separately analyzed to extract high-level soft features, including audio cough confidence and cross-modal temporal alignment. At the decision level, a rule-enhanced Learnable Fusion Gate (LFG) is designed to integrate noise awareness, alignment regulation, and lightweight nonlinear calibration into a unified probabilistic fusion framework, enabling adaptive multimodal modeling. Experiments on a self-collected dataset of 1024 real-world audio–visual samples demonstrate that the proposed method achieves stable and strong performance. In particular, the combination of Qwen3-Omni-Flash and LFG yields the best results, with an accuracy of 0.9229 and an F1-score of 0.9200, significantly outperforming unimodal baselines and the Direct-LLM scheme without explicit fusion.

源语言英语
文章编号103482
期刊Displays
94
DOI
出版状态已出版 - 9月 2026

指纹

探究 'LFG-LLM: A multimodal large language model-based audio–visual cough detection framework' 的科研主题。它们共同构成独一无二的指纹。

引用此