Abstract
Contaminant property data sets are typically small, posing challenges for developing accurate deep learning (DL) models. In this study, we pretrained ResNet18 models on the PubChem data set (∼10 million molecules) using molecular RGB images as inputs and their MACCS fingerprints as labels, generating five models (Chemage1 to Chemage5) with various pretraining accuracy, and fine-tuned them on 10 MoleculeNet and 12 contaminant property data sets. We found that appropriate model architectures and fine-tuning techniques significantly improve the transfer learning efficacy. We then developed an ensemble model, Ens-Chemage, to combine the strengths of these individual models. Ens-Chemage outperformed conventional machine learning (ML) models and ImageMol on almost all tested data sets. Through model interpretation, we found that Ens-Chemage learned more accurate and distinct features than the other models. Additionally, we defined its applicability domain (AD) by using an uncertainty-based approach. Finally, Ens-Chemage has been deployed for free public use at https://ens-chemage.streamlit.app/. This study presents significant advancements in the application of DL for small contaminant property data sets.
| Original language | English |
|---|---|
| Pages (from-to) | 1200-1206 |
| Number of pages | 7 |
| Journal | Environmental Science and Technology Letters |
| Volume | 11 |
| Issue number | 11 |
| DOIs | |
| State | Published - 12 Nov 2024 |
Keywords
- Deep learning
- Ensemble learning
- Molecular image
- Molecular property prediction
- Transfer learning