Identifying subjective life expectancy risk factors in physically active and inactive middle-aged and older adults using machine learning models

Jian Yang, Zhihui Li*, Ming Wu, Yuan Zhang, Bianjiang Zhang, Huiyu Shi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Physical activity is a key focus in the field of public health, and subjective life expectancy is closely associated with individuals’ physical and psychological well-being. This study aimed to identify the risk factors for subjective life expectancy among middle-aged and older adults with active and inactive physical activity levels, and to provide an evidence base for developing differentiated health intervention strategies. Methods: Based on data from the China Health and Retirement Longitudinal Study (CHARLS) 2018 survey, a total of 10,945 participants were included. Five machine learning models, including Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were separately constructed for the active and inactive groups. To reduce bias caused by class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to generate synthetic samples for the minority class. The dataset was split into a training set (70%) and a testing set (30%), and ten-fold cross-validation combined with grid search was employed to optimize hyperparameters, ensuring both robustness and generalizability of the models. Model performance was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC), accuracy, sensitivity, specificity, and F1-score. Results: The active group (4,707 men and 4,885 women) had a mean age of 59.76 years, while the inactive group (662 men and 691 women) had a mean age of 63.00 years. The Support Vector Machine (SVM) model achieved the best performance in the inactive group (AUC: 0.797; accuracy: 0.722; sensitivity: 0.747), whereas the Light Gradient Boosting Machine (LightGBM) model achieved the best performance in the active group (AUC: 0.775; accuracy: 0.745; specificity: 0.814). Feature importance analysis indicated that “age” was the most important variable in the Support Vector Machine (SVM) model, while “perceived health” was the most important variable in the Light Gradient Boosting Machine (LightGBM) model. Conclusion: Machine learning methods can effectively identify key risk factors influencing subjective life expectancy among middle-aged and older adults, and provide valuable guidance for targeted health management strategies tailored to populations with different levels of physical activity.

Original languageEnglish
Article number3506
JournalBMC Public Health
Volume25
Issue number1
DOIs
StatePublished - Dec 2025

Keywords

  • Machine learning
  • Middle-aged and elderly people
  • Physical activity
  • Subjective life expectancy

Fingerprint

Dive into the research topics of 'Identifying subjective life expectancy risk factors in physically active and inactive middle-aged and older adults using machine learning models'. Together they form a unique fingerprint.

Cite this