Skip to main navigation Skip to search Skip to main content

YOLOSAM: A unified and efficient anomaly detection model based on auto mask prompt

  • Ruizhi Yu
  • , Weiting Chen*
  • , Jiahao Fan
  • , Xiang Li
  • , Zheming Fan
  • , Qing Zhang
  • *Corresponding author for this work
  • East China Normal University

Research output: Contribution to journalArticlepeer-review

Abstract

Anomaly detection is a critical task in industrial manufacturing, and leveraging artificial intelligence to identify product anomalies is essential for significantly enhancing production efficiency. However, most existing approaches follow the one-to-one paradigm, where a customized model is trained for each category, incurring substantial computational and memory costs. Although some methods have emerged for universal anomaly detection in recent years, they usually require carefully designed text prompts or have slow inference speeds. Moreover, most anomaly detection methods lack the capability of fine-grained anomaly classification, necessitating additional training of classification models for practical applications with different categories. To address these challenges, we propose YOLOSAM, a unified and efficient anomaly detection model based on auto mask prompt. YOLOSAM is a dual-branch architecture that can handle multi-class few-shot anomaly detection with a unified model, including both segmentation and classification branches. In the segmentation branch, we design an auto mask prompt generator that generates mask prompts directly from visual information, eliminating the need for complex prompt engineering. In the detection branch, we design a defect detection head that utilizes the visual information to achieve fine-grained anomaly classification. Additionally, we employed knowledge distillation techniques to compress the image encoder, and both branches share this distilled encoder, effectively preserving SAM’s general knowledge while significantly enhancing the inference speed. YOLOSAM achieved anomaly classification and segmentation results of 95.6%/96.8% AUROC on the MVTec-AD dataset and 90.2%/97.6% AUROC on the VisA dataset under multi-class and 4-shot settings. The model achieves an inference speed of 46 ms per image, approximately 3 times faster than SOTA methods.

Original languageEnglish
Article number1055
JournalSignal, Image and Video Processing
Volume19
Issue number12
DOIs
StatePublished - Dec 2025

Keywords

  • Automatic prompt
  • Knowledge distill
  • Lightweight model
  • SAM
  • Unified anomaly detection

Fingerprint

Dive into the research topics of 'YOLOSAM: A unified and efficient anomaly detection model based on auto mask prompt'. Together they form a unique fingerprint.

Cite this