Developing Quantitative Structure-Activity Relationship (QSAR) Models for Water Contaminants’ Activities/Properties by Fine-Tuning GPT-3 Models

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

In this study, we developed quantitative structure-activity relationship (QSAR) models for water contaminants’ activities/properties by fine-tuning GPT-3 models. We also proposed a novel masked atom importance (MAI) approach for model interpretation and an OpenAIEmbedding similarity-based method for determining the applicability domain. We utilized the Simplified Molecular-Input Line-Entry System (SMILES) of contaminants and their corresponding activities/properties from hree data sets: pKd, Koc, and Solubility. These were used as input prompts and completions, respectively, to fine-tune four GPT-3 models (Davinci, Curie, Babbage, and Ada) obtained from OpenAI. The Babbage model demonstrated superior performance for the pKd data set, while the Davinci model excelled with the Koc and Solubility data sets, even outperforming molecular fingerprint (MF) CatBoost-based QSAR models. The MAI interpretation results were qualitatively consistent with the SHapley additive expansion (SHAP) interpretation but exhibited less sensitivity in quantitative analysis. The OpenAIEmbedding similarity-based applicability domain determination approach showed efficacy comparable to that of the MF-based similarity approach but with added robustness. This study underscores the potential of large language models in developing QSAR models, paving the way for further advancements in QSAR modeling using state-of-the-art language models.

Original languageEnglish
Pages (from-to)872-877
Number of pages6
JournalEnvironmental Science and Technology Letters
Volume10
Issue number10
DOIs
StatePublished - 10 Oct 2023

Keywords

  • Fine-tuning
  • GPT-3
  • Machine learning
  • QSAR
  • Water contaminants

Fingerprint

Dive into the research topics of 'Developing Quantitative Structure-Activity Relationship (QSAR) Models for Water Contaminants’ Activities/Properties by Fine-Tuning GPT-3 Models'. Together they form a unique fingerprint.

Cite this