Entity matching across multiple heterogeneous data sources

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

27 Scopus citations

Abstract

Entity matching is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to entity profiling, product recommendation, etc. The matching process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we design an unsupervised approach, called EMAN, to match entities across two or more heterogeneous data sources. The algorithm utilizes the locality sensitive hashing schema to reduce the candidate tuples and speed up the matching process. To handle the heterogeneous entity attributes, we employ the exponential family to model the similarities between the different attributes. EMAN is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EMAN on re-identifying entities from the same data source, as well as matching entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

Original languageEnglish
Title of host publicationDatabase Systems for Advanced Applications - 21st International Conference, DASFAA 2016, Proceedings
EditorsShamkant B. Navathe, Weili Wu, Shashi Shekhar, Xiaoyong Du, Hui Xiong, X. Sean Wang
PublisherSpringer Verlag
Pages133-146
Number of pages14
ISBN (Print)9783319320243
DOIs
StatePublished - 2016
Event21st International Conference on Database Systems for Advanced Applications, DASFAA 2016 - Dallas, United States
Duration: 16 Apr 201619 Apr 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9642
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Database Systems for Advanced Applications, DASFAA 2016
Country/TerritoryUnited States
CityDallas
Period16/04/1619/04/16

Keywords

  • Entity matching
  • Exponential family
  • Locality sensitive hashing

Fingerprint

Dive into the research topics of 'Entity matching across multiple heterogeneous data sources'. Together they form a unique fingerprint.

Cite this