Abstract
Eliminating duplications in large databases becomes a hot issue in the research of data quality. The problem is studied and an efficient N-Gram based approach for detecting approximately duplicate database records is proposed. The contributions of this paper are: (1) An efficient N-Gram based clustering algorithm is proposed and an improved N-Gram based algorithm is proposed. (2) A very efficient application independent Pair-Wise comparison algorithm based on the edit distance is exploited. (3) For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented. Furthermore, an effective experimental environment is set up and a lot of algorithm tests are carried out. Plenty of results are produced through a great deal of different actual experiments. The corresponding aborative analysis is also presented here. Based on all of the experiments and analysis in this paper, the efficiency, rationality and scientificity of the N-Gram based approximately duplicate record detecting approach is validated.
| Original language | English |
|---|---|
| Pages (from-to) | 69-77 |
| Number of pages | 9 |
| Journal | Jisuanji Xuebao/Chinese Journal of Computers |
| Volume | 24 |
| Issue number | 1 |
| State | Published - Jan 2001 |
| Externally published | Yes |
Keywords
- Approximately duplicated records
- Clustering
- Information integration
- N-Gram
- Pair-wise
- Priority queue