Enlarging Applicability Domain of Quantitative Structure-Activity Relationship Models through Uncertainty-Based Active Learning

Shifa Zhong, Dylan R. Lambeth, Thomas K. Igou, Yongsheng Chen

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

The first step to develop a quantitative structure-activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k•OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model's prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.

Original languageEnglish
Pages (from-to)1211-1220
Number of pages10
JournalACS ES and T Engineering
Volume2
Issue number7
DOIs
StatePublished - 8 Jul 2022
Externally publishedYes

Keywords

  • Gaussian process
  • QSAR
  • active learning
  • applicability domain
  • uncertainty

Fingerprint

Dive into the research topics of 'Enlarging Applicability Domain of Quantitative Structure-Activity Relationship Models through Uncertainty-Based Active Learning'. Together they form a unique fingerprint.

Cite this