TY - JOUR
T1 - MapReduce-based entity matching with multiple blocking functions
AU - Jin, Cheqing
AU - Chen, Jie
AU - Liu, Huiping
N1 - Publisher Copyright:
© 2016, Higher Education Press and Springer-Verlag Berlin Heidelberg.
PY - 2017/10/1
Y1 - 2017/10/1
N2 - Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking-based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.
AB - Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking-based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.
KW - MapReduce
KW - entity matching
KW - load balancing
KW - pair deduplication
UR - https://www.scopus.com/pages/publications/84981277923
U2 - 10.1007/s11704-016-5346-4
DO - 10.1007/s11704-016-5346-4
M3 - 文章
AN - SCOPUS:84981277923
SN - 2095-2228
VL - 11
SP - 895
EP - 911
JO - Frontiers of Computer Science
JF - Frontiers of Computer Science
IS - 5
ER -