TY - JOUR
T1 - Taming vision transformers for clinical laryngoscopy assessment
AU - Zhang, Xinzhu
AU - Zhao, Jing
AU - Zong, Daoming
AU - Ren, Henglei
AU - Gao, Chunli
N1 - Publisher Copyright:
© 2024
PY - 2025/2
Y1 - 2025/2
N2 - Objective: Laryngoscopy, essential for diagnosing laryngeal cancer (LCA), faces challenges due to high inter-observer variability and the reliance on endoscopist expertise. Distinguishing precancerous from early-stage cancerous lesions is particularly challenging, even for experienced practitioners, given their similar appearances. This study aims to enhance laryngoscopic image analysis to improve early screening/detection of cancer or precancerous conditions. Methods: We propose MedFormer, a laryngeal cancer classification method based on the Vision Transformer (ViT). To address data scarcity, MedFormer employs a customized transfer learning approach that leverages the representational power of pre-trained transformers. This method enables robust out-of-domain generalization by fine-tuning a minimal set of additional parameters. Results: MedFormer exhibits sensitivity-specificity values of 98%–89% for identifying precancerous lesions (leukoplakia) and 89%–97% for detecting cancer, surpassing CNN counterparts significantly. Additionally, when compared to the two selected ViT-based models, MedFormer also demonstrates superior performance. It also outperforms physician visual evaluations (PVE) in certain scenarios and matches PVE performance in all cases. Visualizations using class activation maps (CAM) and deformable patches demonstrate MedFormer's interpretability, aiding clinicians in understanding the model's predictions. Conclusion: We highlight the potential of visual transformers in clinical laryngoscopic assessments, presenting MedFormer as an effective method for the early detection of laryngeal cancer.
AB - Objective: Laryngoscopy, essential for diagnosing laryngeal cancer (LCA), faces challenges due to high inter-observer variability and the reliance on endoscopist expertise. Distinguishing precancerous from early-stage cancerous lesions is particularly challenging, even for experienced practitioners, given their similar appearances. This study aims to enhance laryngoscopic image analysis to improve early screening/detection of cancer or precancerous conditions. Methods: We propose MedFormer, a laryngeal cancer classification method based on the Vision Transformer (ViT). To address data scarcity, MedFormer employs a customized transfer learning approach that leverages the representational power of pre-trained transformers. This method enables robust out-of-domain generalization by fine-tuning a minimal set of additional parameters. Results: MedFormer exhibits sensitivity-specificity values of 98%–89% for identifying precancerous lesions (leukoplakia) and 89%–97% for detecting cancer, surpassing CNN counterparts significantly. Additionally, when compared to the two selected ViT-based models, MedFormer also demonstrates superior performance. It also outperforms physician visual evaluations (PVE) in certain scenarios and matches PVE performance in all cases. Visualizations using class activation maps (CAM) and deformable patches demonstrate MedFormer's interpretability, aiding clinicians in understanding the model's predictions. Conclusion: We highlight the potential of visual transformers in clinical laryngoscopic assessments, presenting MedFormer as an effective method for the early detection of laryngeal cancer.
KW - Deep learning
KW - Laryngeal cancer
KW - Medical image classification
KW - Transfer learning
KW - Transformer
UR - https://www.scopus.com/pages/publications/85215387819
U2 - 10.1016/j.jbi.2024.104766
DO - 10.1016/j.jbi.2024.104766
M3 - 文章
C2 - 39827999
AN - SCOPUS:85215387819
SN - 1532-0464
VL - 162
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 104766
ER -