Abstract
Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
| Original language | English |
|---|---|
| Article number | 6880793 |
| Pages (from-to) | 1005-1018 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Knowledge and Data Engineering |
| Volume | 27 |
| Issue number | 4 |
| DOIs | |
| State | Published - 1 Apr 2015 |
Keywords
- Data integration, similarity search
- gram-based framework
Fingerprint
Dive into the research topics of 'GFilter: A General Gram Filter for String Similarity Search'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver