TY - JOUR
T1 - RPSubAlign
T2 - a novel sequence-based molecular representation method for retrosynthesis prediction with improved validity and robustness
AU - Hu, Yuting
AU - Hu, Feng
AU - Zhang, Hongwen
AU - Xu, Hongling
AU - Gao, Jixiang
AU - Deng, Wenshuai
AU - Tian, Zijing
AU - Hu, Qiaoyu
AU - Li, Honglin
AU - Diao, Yanyan
N1 - Publisher Copyright:
© 2025 The Author(s).
PY - 2025/5/1
Y1 - 2025/5/1
N2 - Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions. Herein, we introduce RPSubAlign, a molecular sequence representation method specifically tailored for retrosynthetic tasks, which aligns common substructures between reactants and products to enhance the validity and robustness of sequence-based models. Compared with conventional random and root-alignment representations, RPSubAlign achieves better performance on the USPTO-50K and USPTO-MIT datasets, improving up to a 34.8% increase in Top-N accuracy (with Self-Referencing Embedded Strings representation) and demonstrating enhanced stability across various data augmentation scenarios. RPSubAlign significantly improves syntactic validity, reaching 86.64% on USPTO-50K and 96.45% on USPTO-MIT (with Simplified Molecular Input Line Entry System representation), outperforming baseline methods. These results highlight RPSubAlign as a robust, effective approach for molecular characterization method for retrosynthesis predictions.
AB - Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions. Herein, we introduce RPSubAlign, a molecular sequence representation method specifically tailored for retrosynthetic tasks, which aligns common substructures between reactants and products to enhance the validity and robustness of sequence-based models. Compared with conventional random and root-alignment representations, RPSubAlign achieves better performance on the USPTO-50K and USPTO-MIT datasets, improving up to a 34.8% increase in Top-N accuracy (with Self-Referencing Embedded Strings representation) and demonstrating enhanced stability across various data augmentation scenarios. RPSubAlign significantly improves syntactic validity, reaching 86.64% on USPTO-50K and 96.45% on USPTO-MIT (with Simplified Molecular Input Line Entry System representation), outperforming baseline methods. These results highlight RPSubAlign as a robust, effective approach for molecular characterization method for retrosynthesis predictions.
KW - computer-aided synthesis planning
KW - deep learning
KW - molecular representation
UR - https://www.scopus.com/pages/publications/105008133231
U2 - 10.1093/bib/bbaf257
DO - 10.1093/bib/bbaf257
M3 - 文章
AN - SCOPUS:105008133231
SN - 1467-5463
VL - 26
JO - Briefings in Bioinformatics
JF - Briefings in Bioinformatics
IS - 3
M1 - bbaf257
ER -