TY - JOUR
T1 - Top-k Spatio-Textual Similarity Join
AU - Hu, Huiqi
AU - Li, Guoliang
AU - Bao, Zhifeng
AU - Feng, Jianhua
AU - Wu, Yongwei
AU - Gong, Zhiguo
AU - Xu, Yaoqiang
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2016/2/1
Y1 - 2016/2/1
N2 - With the development of location-based services (LBS), LBS users are generating more and more spatio-textual data, e.g., checkins and attraction reviews. Since a spatio-textual entity may have different representations, possibly due to GPS deviations or typographical errors, it calls for effective methods to integrate the spatio-textual data from different data sources. In this paper, we study the problem of top-k spatio-textual similarity join (Topk-STJoin), which identifies the k most similar pairs from two spatio-textual data sets. One big challenge in Topk-STJoin is to efficiently identify the top- k similar pairs by considering both textual relevancy and spatial proximity. Traditional join algorithms that consider only one dimension (textual or spatial) are inefficient because they cannot utilize the pruning ability on the other dimension. To address this challenge, we propose a signature-based top-k join framework. We first generate a spatio-textual signature set for each object such that if two objects are in the top- k similar pairs, their signature sets must overlap. With this property, we can prune large numbers of dissimilar pairs without common signatures. We find that the order of accessing the signatures has a significant effect on the performance. So, we compute an upper bound for each signature and propose a best-first accessing method that preferentially accesses signatures with large upper bounds while those pairs with small upper bounds can be pruned. We prove the optimality of our best-first accessing method. Next, we optimize the spatio-textual signatures and propose progressive signatures to further improve the pruning power. Experimental results on real-world datasets show that our algorithm achieves high performance and good scalability, and significantly outperforms baseline approaches.
AB - With the development of location-based services (LBS), LBS users are generating more and more spatio-textual data, e.g., checkins and attraction reviews. Since a spatio-textual entity may have different representations, possibly due to GPS deviations or typographical errors, it calls for effective methods to integrate the spatio-textual data from different data sources. In this paper, we study the problem of top-k spatio-textual similarity join (Topk-STJoin), which identifies the k most similar pairs from two spatio-textual data sets. One big challenge in Topk-STJoin is to efficiently identify the top- k similar pairs by considering both textual relevancy and spatial proximity. Traditional join algorithms that consider only one dimension (textual or spatial) are inefficient because they cannot utilize the pruning ability on the other dimension. To address this challenge, we propose a signature-based top-k join framework. We first generate a spatio-textual signature set for each object such that if two objects are in the top- k similar pairs, their signature sets must overlap. With this property, we can prune large numbers of dissimilar pairs without common signatures. We find that the order of accessing the signatures has a significant effect on the performance. So, we compute an upper bound for each signature and propose a best-first accessing method that preferentially accesses signatures with large upper bounds while those pairs with small upper bounds can be pruned. We prove the optimality of our best-first accessing method. Next, we optimize the spatio-textual signatures and propose progressive signatures to further improve the pruning power. Experimental results on real-world datasets show that our algorithm achieves high performance and good scalability, and significantly outperforms baseline approaches.
UR - https://www.scopus.com/pages/publications/84962468997
U2 - 10.1109/TKDE.2015.2485213
DO - 10.1109/TKDE.2015.2485213
M3 - 文章
AN - SCOPUS:84962468997
SN - 1041-4347
VL - 28
SP - 551
EP - 565
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 2
M1 - 7286817
ER -