摘要
As a critical step in data integration and data cleaning, entity resolution (ER) aims at identifying groups of records that refer to the same real-world entity. Currently, there mainly exist two typical methods to handle this issue. One is exhaustive entity resolution, which compares all record pairs to determine the entity they belong to. However, its complexity (O (n2), n stands for the size of dataset) is too high to handle big volume dataset. The other is blocking-based entity resolution, which maps similar records to the same block by a specific method (e. g., hash function, sliding window, etc). Then only the records in the same block need to be compared. This method improves the efficiency while sacrifices the effectiveness. Since some records refer to the same entity may not in the same block. In this paper we propose a pattern-based entity resolution, which represents the similar records by a record pattern, then we will generate a bound by comparing record patterns. With this bound, we can decide if the two patterns' corresponding records need to be precisely compared to verify whether they refer to the same entity. In this way, we can both dramatically accelerate the process of entity resolution by filtering dissimilar records and ensure its correctness. Experiments on real and synthetic dataset show the efficiency and effectiveness of our method.
| 源语言 | 英语 |
|---|---|
| 页(从-至) | 1796-1808 |
| 页数 | 13 |
| 期刊 | Jisuanji Xuebao/Chinese Journal of Computers |
| 卷 | 38 |
| 期 | 9 |
| DOI | |
| 出版状态 | 已出版 - 1 9月 2015 |
指纹
探究 'A pattern-based entity resolution algorithm' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver