Similarity query processing for probabilistic sets

Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.

Original languageEnglish
Title of host publicationICDE 2013 - 29th International Conference on Data Engineering
Pages913-924
Number of pages12
DOIs
StatePublished - 2013
Event29th International Conference on Data Engineering, ICDE 2013 - Brisbane, QLD, Australia
Duration: 8 Apr 201311 Apr 2013

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Conference

Conference29th International Conference on Data Engineering, ICDE 2013
Country/TerritoryAustralia
CityBrisbane, QLD
Period8/04/1311/04/13

Fingerprint

Dive into the research topics of 'Similarity query processing for probabilistic sets'. Together they form a unique fingerprint.

Cite this