TY - JOUR
T1 - LFG-LLM
T2 - A multimodal large language model-based audio–visual cough detection framework
AU - Feng, Mingke
AU - Han, Yingying
AU - Hu, Menghan
N1 - Publisher Copyright:
© 2026 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
PY - 2026/9
Y1 - 2026/9
N2 - Cough detection in real-world environments faces significant challenges. Audio modalities are highly susceptible to environmental noise, while visual modalities suffer from the short duration and small motion amplitude of cough actions, which are also highly similar to behaviors such as sneezing and throat clearing, making stable and reliable recognition difficult. To address these challenges, this paper proposes a cough detection framework based on multimodal large language models. On the visual side, two-dimensional skeletal keypoint sequences are employed to model temporal motion features. At the semantic level, a hierarchical LLM architecture is introduced, where audio and skeletal features are separately analyzed to extract high-level soft features, including audio cough confidence and cross-modal temporal alignment. At the decision level, a rule-enhanced Learnable Fusion Gate (LFG) is designed to integrate noise awareness, alignment regulation, and lightweight nonlinear calibration into a unified probabilistic fusion framework, enabling adaptive multimodal modeling. Experiments on a self-collected dataset of 1024 real-world audio–visual samples demonstrate that the proposed method achieves stable and strong performance. In particular, the combination of Qwen3-Omni-Flash and LFG yields the best results, with an accuracy of 0.9229 and an F1-score of 0.9200, significantly outperforming unimodal baselines and the Direct-LLM scheme without explicit fusion.
AB - Cough detection in real-world environments faces significant challenges. Audio modalities are highly susceptible to environmental noise, while visual modalities suffer from the short duration and small motion amplitude of cough actions, which are also highly similar to behaviors such as sneezing and throat clearing, making stable and reliable recognition difficult. To address these challenges, this paper proposes a cough detection framework based on multimodal large language models. On the visual side, two-dimensional skeletal keypoint sequences are employed to model temporal motion features. At the semantic level, a hierarchical LLM architecture is introduced, where audio and skeletal features are separately analyzed to extract high-level soft features, including audio cough confidence and cross-modal temporal alignment. At the decision level, a rule-enhanced Learnable Fusion Gate (LFG) is designed to integrate noise awareness, alignment regulation, and lightweight nonlinear calibration into a unified probabilistic fusion framework, enabling adaptive multimodal modeling. Experiments on a self-collected dataset of 1024 real-world audio–visual samples demonstrate that the proposed method achieves stable and strong performance. In particular, the combination of Qwen3-Omni-Flash and LFG yields the best results, with an accuracy of 0.9229 and an F1-score of 0.9200, significantly outperforming unimodal baselines and the Direct-LLM scheme without explicit fusion.
KW - Audio–visual analysis
KW - Cough detection
KW - Multimodal fusion
KW - Multimodal large language models
UR - https://www.scopus.com/pages/publications/105036270349
U2 - 10.1016/j.displa.2026.103482
DO - 10.1016/j.displa.2026.103482
M3 - 文章
AN - SCOPUS:105036270349
SN - 0141-9382
VL - 94
JO - Displays
JF - Displays
M1 - 103482
ER -