摘要
Eliminating duplications in large databases becomes a hot issue in the research of data quality. The problem is studied and an efficient N-Gram based approach for detecting approximately duplicate database records is proposed. The contributions of this paper are: (1) An efficient N-Gram based clustering algorithm is proposed and an improved N-Gram based algorithm is proposed. (2) A very efficient application independent Pair-Wise comparison algorithm based on the edit distance is exploited. (3) For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented. Furthermore, an effective experimental environment is set up and a lot of algorithm tests are carried out. Plenty of results are produced through a great deal of different actual experiments. The corresponding aborative analysis is also presented here. Based on all of the experiments and analysis in this paper, the efficiency, rationality and scientificity of the N-Gram based approximately duplicate record detecting approach is validated.
| 源语言 | 英语 |
|---|---|
| 页(从-至) | 69-77 |
| 页数 | 9 |
| 期刊 | Jisuanji Xuebao/Chinese Journal of Computers |
| 卷 | 24 |
| 期 | 1 |
| 出版状态 | 已出版 - 1月 2001 |
| 已对外发布 | 是 |
指纹
探究 'Efficient approach for detecting approximately duplicate database records' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver