跳到主要导航 跳到搜索 跳到主要内容

A pattern-based entity resolution algorithm

科研成果: 期刊稿件文章同行评审

摘要

As a critical step in data integration and data cleaning, entity resolution (ER) aims at identifying groups of records that refer to the same real-world entity. Currently, there mainly exist two typical methods to handle this issue. One is exhaustive entity resolution, which compares all record pairs to determine the entity they belong to. However, its complexity (O (n2), n stands for the size of dataset) is too high to handle big volume dataset. The other is blocking-based entity resolution, which maps similar records to the same block by a specific method (e. g., hash function, sliding window, etc). Then only the records in the same block need to be compared. This method improves the efficiency while sacrifices the effectiveness. Since some records refer to the same entity may not in the same block. In this paper we propose a pattern-based entity resolution, which represents the similar records by a record pattern, then we will generate a bound by comparing record patterns. With this bound, we can decide if the two patterns' corresponding records need to be precisely compared to verify whether they refer to the same entity. In this way, we can both dramatically accelerate the process of entity resolution by filtering dissimilar records and ensure its correctness. Experiments on real and synthetic dataset show the efficiency and effectiveness of our method.

源语言英语
页(从-至)1796-1808
页数13
期刊Jisuanji Xuebao/Chinese Journal of Computers
38
9
DOI
出版状态已出版 - 1 9月 2015

指纹

探究 'A pattern-based entity resolution algorithm' 的科研主题。它们共同构成独一无二的指纹。

引用此