TY - JOUR
T1 - CELOF
T2 - Effective and fast memory efficient local outlier detection in high-dimensional data streams
AU - Chen, Liang
AU - Wang, Wei
AU - Yang, Yun
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/4
Y1 - 2021/4
N2 - Outlier detection is an important and challenging problem in industrial automation, where data are often collected in large amounts but with little labeled information. To realize real-time outlier detection on data streams, many models have been proposed in the academic. However, most existing outlier detection algorithms still have two main limitations: (1) Need a large amount of memory to store data. (2) Poor detection of high-dimensional data in application scenarios. In this paper, we propose a new algorithm, called CELOF which can effectively overcome the two limitations. In CELOF, We first use information entropy to construct a new index weight calculation method, which can distinguish the influencing factors of different indexes and improve the detection accuracy of multi-dimensional data. Next, we designed a new reachable distance factor discrimination method to extract the original data information and then proposed a new strategy for outlier detection, which can greatly reduce the amount of data storage. Finally, the final experiment result shows that the CELOF algorithm has an average improvement of 15% in accuracy compared to the state-of-the-art algorithms, and the CELOF's running time less than 1% of the original LOF. Additionally, our comprehensive experiments use different real data sets for simulation, and the results show that our algorithm can be widely used in different practical application scenarios without any prior information and data distribution.
AB - Outlier detection is an important and challenging problem in industrial automation, where data are often collected in large amounts but with little labeled information. To realize real-time outlier detection on data streams, many models have been proposed in the academic. However, most existing outlier detection algorithms still have two main limitations: (1) Need a large amount of memory to store data. (2) Poor detection of high-dimensional data in application scenarios. In this paper, we propose a new algorithm, called CELOF which can effectively overcome the two limitations. In CELOF, We first use information entropy to construct a new index weight calculation method, which can distinguish the influencing factors of different indexes and improve the detection accuracy of multi-dimensional data. Next, we designed a new reachable distance factor discrimination method to extract the original data information and then proposed a new strategy for outlier detection, which can greatly reduce the amount of data storage. Finally, the final experiment result shows that the CELOF algorithm has an average improvement of 15% in accuracy compared to the state-of-the-art algorithms, and the CELOF's running time less than 1% of the original LOF. Additionally, our comprehensive experiments use different real data sets for simulation, and the results show that our algorithm can be widely used in different practical application scenarios without any prior information and data distribution.
KW - Data extract
KW - Data stream
KW - Outlier detection
UR - https://www.scopus.com/pages/publications/85098994239
U2 - 10.1016/j.asoc.2021.107079
DO - 10.1016/j.asoc.2021.107079
M3 - 文章
AN - SCOPUS:85098994239
SN - 1568-4946
VL - 102
JO - Applied Soft Computing
JF - Applied Soft Computing
M1 - 107079
ER -