RPSubAlign: a novel sequence-based molecular representation method for retrosynthesis prediction with improved validity and robustness

  • Yuting Hu
  • , Feng Hu
  • , Hongwen Zhang
  • , Hongling Xu
  • , Jixiang Gao
  • , Wenshuai Deng
  • , Zijing Tian
  • , Qiaoyu Hu
  • , Honglin Li*
  • , Yanyan Diao*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions. Herein, we introduce RPSubAlign, a molecular sequence representation method specifically tailored for retrosynthetic tasks, which aligns common substructures between reactants and products to enhance the validity and robustness of sequence-based models. Compared with conventional random and root-alignment representations, RPSubAlign achieves better performance on the USPTO-50K and USPTO-MIT datasets, improving up to a 34.8% increase in Top-N accuracy (with Self-Referencing Embedded Strings representation) and demonstrating enhanced stability across various data augmentation scenarios. RPSubAlign significantly improves syntactic validity, reaching 86.64% on USPTO-50K and 96.45% on USPTO-MIT (with Simplified Molecular Input Line Entry System representation), outperforming baseline methods. These results highlight RPSubAlign as a robust, effective approach for molecular characterization method for retrosynthesis predictions.

Original languageEnglish
Article numberbbaf257
JournalBriefings in Bioinformatics
Volume26
Issue number3
DOIs
StatePublished - 1 May 2025

Keywords

  • computer-aided synthesis planning
  • deep learning
  • molecular representation

Fingerprint

Dive into the research topics of 'RPSubAlign: a novel sequence-based molecular representation method for retrosynthesis prediction with improved validity and robustness'. Together they form a unique fingerprint.

Cite this