Taming vision transformers for clinical laryngoscopy assessment

Xinzhu Zhang, Jing Zhao, Daoming Zong, Henglei Ren, Chunli Gao

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Objective: Laryngoscopy, essential for diagnosing laryngeal cancer (LCA), faces challenges due to high inter-observer variability and the reliance on endoscopist expertise. Distinguishing precancerous from early-stage cancerous lesions is particularly challenging, even for experienced practitioners, given their similar appearances. This study aims to enhance laryngoscopic image analysis to improve early screening/detection of cancer or precancerous conditions. Methods: We propose MedFormer, a laryngeal cancer classification method based on the Vision Transformer (ViT). To address data scarcity, MedFormer employs a customized transfer learning approach that leverages the representational power of pre-trained transformers. This method enables robust out-of-domain generalization by fine-tuning a minimal set of additional parameters. Results: MedFormer exhibits sensitivity-specificity values of 98%–89% for identifying precancerous lesions (leukoplakia) and 89%–97% for detecting cancer, surpassing CNN counterparts significantly. Additionally, when compared to the two selected ViT-based models, MedFormer also demonstrates superior performance. It also outperforms physician visual evaluations (PVE) in certain scenarios and matches PVE performance in all cases. Visualizations using class activation maps (CAM) and deformable patches demonstrate MedFormer's interpretability, aiding clinicians in understanding the model's predictions. Conclusion: We highlight the potential of visual transformers in clinical laryngoscopic assessments, presenting MedFormer as an effective method for the early detection of laryngeal cancer.

Original languageEnglish
Article number104766
JournalJournal of Biomedical Informatics
Volume162
DOIs
StatePublished - Feb 2025

Keywords

  • Deep learning
  • Laryngeal cancer
  • Medical image classification
  • Transfer learning
  • Transformer

Fingerprint

Dive into the research topics of 'Taming vision transformers for clinical laryngoscopy assessment'. Together they form a unique fingerprint.

Cite this