Prophet: SSD Failure Analysis and Prediction Guided by Flash Reliability Characteristics in Data Centers

  • Yunpeng Song
  • , Yujiong Liang
  • , Jialin Liu
  • , Liang Shi*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Solid-state drives (SSDs) are massively deployed in various fields, especially in data centers, for their excellent cost-effectiveness. However, SSDs may fail due to their imperfect manufacturing processes, resulting in system-level failures and even downtime in data centers. This makes SSD failure prediction critical. Current studies focus on dealing with data missing, numerical normalization, and other statistical issues in using machine learning methods, but the consideration of the reliability characteristics of the underlying flash media of SSDs and the timeliness (time duration between predicted failure and real failure) of SSD failure prediction result is missing. In this work, we study the failure characteristics of over 200,000 drives from industry data centers over a 4-year period, as well as daily data. The relationship between SSD attribute values and failures is first investigated. Then, we analyzed the SSD failure characteristics from several aspects (causes, differences between failures, and timeliness of prediction results) relying on flash reliability characteristics. Based on these, a novel SSD failure prediction method (Prophet) is proposed. Specifically, Prophet contains the following two components. First, to cope with the differences between failures, a diff-state method is proposed for differential machine learning modeling of SSDs in different “States”. We define the “State” of an SSD, which represents the range of values in which the SSD currently lies in terms of some key attributes. Through flash reliability characteristics, we distinguish between different failures before training the model to obtain accurate predictions of different failure behaviors. Second, a recovery period method is proposed to enhance the timeliness of SSD failure prediction result by designing the sample selection method. The enhanced timeliness can be utilized by operations personnel to handle failed SSDs, such as replacement and repair. The evaluation results of the real dataset show that the predictive ability of Prophet is improved amazingly, realizing a high recall and low false-positive rates while providing sufficient response time for the processing of failed SSDs.

Original languageEnglish
Pages (from-to)2529-2541
Number of pages13
JournalIEEE Transactions on Computers
Volume74
Issue number8
DOIs
StatePublished - 2025

Keywords

  • SSD failure analysis
  • SSD failure prediction
  • flash reliability characteristics

Fingerprint

Dive into the research topics of 'Prophet: SSD Failure Analysis and Prediction Guided by Flash Reliability Characteristics in Data Centers'. Together they form a unique fingerprint.

Cite this