Efficient approach for detecting approximately duplicate database records

  • Y. F. Qiu*
  • , Z. P. Tian
  • , W. Y. Ji
  • , A. Y. Zhou
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

Eliminating duplications in large databases becomes a hot issue in the research of data quality. The problem is studied and an efficient N-Gram based approach for detecting approximately duplicate database records is proposed. The contributions of this paper are: (1) An efficient N-Gram based clustering algorithm is proposed and an improved N-Gram based algorithm is proposed. (2) A very efficient application independent Pair-Wise comparison algorithm based on the edit distance is exploited. (3) For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented. Furthermore, an effective experimental environment is set up and a lot of algorithm tests are carried out. Plenty of results are produced through a great deal of different actual experiments. The corresponding aborative analysis is also presented here. Based on all of the experiments and analysis in this paper, the efficiency, rationality and scientificity of the N-Gram based approximately duplicate record detecting approach is validated.

Original languageEnglish
Pages (from-to)69-77
Number of pages9
JournalJisuanji Xuebao/Chinese Journal of Computers
Volume24
Issue number1
StatePublished - Jan 2001
Externally publishedYes

Keywords

  • Approximately duplicated records
  • Clustering
  • Information integration
  • N-Gram
  • Pair-wise
  • Priority queue

Fingerprint

Dive into the research topics of 'Efficient approach for detecting approximately duplicate database records'. Together they form a unique fingerprint.

Cite this