TY - JOUR
T1 - Prophet
T2 - SSD Failure Analysis and Prediction Guided by Flash Reliability Characteristics in Data Centers
AU - Song, Yunpeng
AU - Liang, Yujiong
AU - Liu, Jialin
AU - Shi, Liang
N1 - Publisher Copyright:
© 1968-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Solid-state drives (SSDs) are massively deployed in various fields, especially in data centers, for their excellent cost-effectiveness. However, SSDs may fail due to their imperfect manufacturing processes, resulting in system-level failures and even downtime in data centers. This makes SSD failure prediction critical. Current studies focus on dealing with data missing, numerical normalization, and other statistical issues in using machine learning methods, but the consideration of the reliability characteristics of the underlying flash media of SSDs and the timeliness (time duration between predicted failure and real failure) of SSD failure prediction result is missing. In this work, we study the failure characteristics of over 200,000 drives from industry data centers over a 4-year period, as well as daily data. The relationship between SSD attribute values and failures is first investigated. Then, we analyzed the SSD failure characteristics from several aspects (causes, differences between failures, and timeliness of prediction results) relying on flash reliability characteristics. Based on these, a novel SSD failure prediction method (Prophet) is proposed. Specifically, Prophet contains the following two components. First, to cope with the differences between failures, a diff-state method is proposed for differential machine learning modeling of SSDs in different “States”. We define the “State” of an SSD, which represents the range of values in which the SSD currently lies in terms of some key attributes. Through flash reliability characteristics, we distinguish between different failures before training the model to obtain accurate predictions of different failure behaviors. Second, a recovery period method is proposed to enhance the timeliness of SSD failure prediction result by designing the sample selection method. The enhanced timeliness can be utilized by operations personnel to handle failed SSDs, such as replacement and repair. The evaluation results of the real dataset show that the predictive ability of Prophet is improved amazingly, realizing a high recall and low false-positive rates while providing sufficient response time for the processing of failed SSDs.
AB - Solid-state drives (SSDs) are massively deployed in various fields, especially in data centers, for their excellent cost-effectiveness. However, SSDs may fail due to their imperfect manufacturing processes, resulting in system-level failures and even downtime in data centers. This makes SSD failure prediction critical. Current studies focus on dealing with data missing, numerical normalization, and other statistical issues in using machine learning methods, but the consideration of the reliability characteristics of the underlying flash media of SSDs and the timeliness (time duration between predicted failure and real failure) of SSD failure prediction result is missing. In this work, we study the failure characteristics of over 200,000 drives from industry data centers over a 4-year period, as well as daily data. The relationship between SSD attribute values and failures is first investigated. Then, we analyzed the SSD failure characteristics from several aspects (causes, differences between failures, and timeliness of prediction results) relying on flash reliability characteristics. Based on these, a novel SSD failure prediction method (Prophet) is proposed. Specifically, Prophet contains the following two components. First, to cope with the differences between failures, a diff-state method is proposed for differential machine learning modeling of SSDs in different “States”. We define the “State” of an SSD, which represents the range of values in which the SSD currently lies in terms of some key attributes. Through flash reliability characteristics, we distinguish between different failures before training the model to obtain accurate predictions of different failure behaviors. Second, a recovery period method is proposed to enhance the timeliness of SSD failure prediction result by designing the sample selection method. The enhanced timeliness can be utilized by operations personnel to handle failed SSDs, such as replacement and repair. The evaluation results of the real dataset show that the predictive ability of Prophet is improved amazingly, realizing a high recall and low false-positive rates while providing sufficient response time for the processing of failed SSDs.
KW - SSD failure analysis
KW - SSD failure prediction
KW - flash reliability characteristics
UR - https://www.scopus.com/pages/publications/105004592219
U2 - 10.1109/TC.2025.3566871
DO - 10.1109/TC.2025.3566871
M3 - 文章
AN - SCOPUS:105004592219
SN - 0018-9340
VL - 74
SP - 2529
EP - 2541
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 8
ER -