跳到主要导航 跳到搜索 跳到主要内容

Efficient approach for detecting approximately duplicate database records

  • Y. F. Qiu*
  • , Z. P. Tian
  • , W. Y. Ji
  • , A. Y. Zhou
  • *此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

Eliminating duplications in large databases becomes a hot issue in the research of data quality. The problem is studied and an efficient N-Gram based approach for detecting approximately duplicate database records is proposed. The contributions of this paper are: (1) An efficient N-Gram based clustering algorithm is proposed and an improved N-Gram based algorithm is proposed. (2) A very efficient application independent Pair-Wise comparison algorithm based on the edit distance is exploited. (3) For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented. Furthermore, an effective experimental environment is set up and a lot of algorithm tests are carried out. Plenty of results are produced through a great deal of different actual experiments. The corresponding aborative analysis is also presented here. Based on all of the experiments and analysis in this paper, the efficiency, rationality and scientificity of the N-Gram based approximately duplicate record detecting approach is validated.

源语言英语
页(从-至)69-77
页数9
期刊Jisuanji Xuebao/Chinese Journal of Computers
24
1
出版状态已出版 - 1月 2001
已对外发布

指纹

探究 'Efficient approach for detecting approximately duplicate database records' 的科研主题。它们共同构成独一无二的指纹。

引用此