TY - JOUR
T1 - GFilter
T2 - A General Gram Filter for String Similarity Search
AU - Hu, Haoji
AU - Zheng, Kai
AU - Wang, Xiaoling
AU - Zhou, Aoying
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2015/4/1
Y1 - 2015/4/1
N2 - Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
AB - Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
KW - Data integration, similarity search
KW - gram-based framework
UR - https://www.scopus.com/pages/publications/84924872273
U2 - 10.1109/TKDE.2014.2349914
DO - 10.1109/TKDE.2014.2349914
M3 - 文章
AN - SCOPUS:84924872273
SN - 1041-4347
VL - 27
SP - 1005
EP - 1018
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 4
M1 - 6880793
ER -