Random-based algorithm for efficient entity matching

  • Pingfu Chao
  • , Zhu Gao
  • , Yuming Li
  • , Junhua Fang
  • , Rong Zhang*
  • , Aoying Zhou
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Most of the state-of-the-art MapReduce-based entity matching methods inherit traditional Entity Resolution techniques on centralized system and focus on data blocking strategies for structured entities n order to solve the load balancing problem occurred in distributed environment. In this paper, we propose a MapReduce-based entity matching framework for Entity Matching on semi-structured and unstructured data. Each entity is represented by a high dimensional vector generated from description data. In order to reduce network transmission, we produce lower dimensional bit-vectors called signatures for those entity vectors based on Locality Sensitive Hash (LSH) function. Our LSH is required for promising cosine similarity. A series of random algorithms are designed to ensure the performance for entity matching. Moreover, our design contains a solution for reducing redundant computation by one round of additional MapReduce job. Experiments show that our approach has a huge advantages on both processing speed and accuracy compared to the other methods.

Original languageEnglish
Title of host publicationWeb Technologies and Applications - 17th Asia-PacificWeb Conference,APWeb 2015, Proceedings
EditorsReynold Cheng, Bin Cui, Zhenjie Zhang, Ruichu Cai, Jia Xu
PublisherSpringer Verlag
Pages509-521
Number of pages13
ISBN (Print)9783319252544
DOIs
StatePublished - 2015
Event17th Asia-PacificWeb Conference, APWeb 2015 - Guangzhou, China
Duration: 18 Sep 201520 Sep 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9313
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th Asia-PacificWeb Conference, APWeb 2015
Country/TerritoryChina
CityGuangzhou
Period18/09/1520/09/15

Fingerprint

Dive into the research topics of 'Random-based algorithm for efficient entity matching'. Together they form a unique fingerprint.

Cite this