MapReduce-based entity matching with multiple blocking functions

Cheqing Jin, Jie Chen, Huiping Liu

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking-based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.

Original languageEnglish
Pages (from-to)895-911
Number of pages17
JournalFrontiers of Computer Science
Volume11
Issue number5
DOIs
StatePublished - 1 Oct 2017

Keywords

  • MapReduce
  • entity matching
  • load balancing
  • pair deduplication

Fingerprint

Dive into the research topics of 'MapReduce-based entity matching with multiple blocking functions'. Together they form a unique fingerprint.

Cite this