TY - JOUR
T1 - EnAli
T2 - entity alignment across multiple heterogeneous data sources
AU - Kong, Chao
AU - Gao, Ming
AU - Xu, Chen
AU - Fu, Yunbin
AU - Qian, Weining
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 2019, Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2019/2/1
Y1 - 2019/2/1
N2 - Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
AB - Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
KW - EM-algorithm
KW - entity alignment
KW - exponential family
KW - locality sensitive hashing
UR - https://www.scopus.com/pages/publications/85048307112
U2 - 10.1007/s11704-017-6561-3
DO - 10.1007/s11704-017-6561-3
M3 - 文章
AN - SCOPUS:85048307112
SN - 2095-2228
VL - 13
SP - 157
EP - 169
JO - Frontiers of Computer Science
JF - Frontiers of Computer Science
IS - 1
ER -