TY - JOUR
T1 - An n-gram-based approach for detecting approximately duplicate database records
AU - Tian, Zengping
AU - Lu, Hongjun
AU - Ji, Wenyun
AU - Zhou, Aoying
AU - Tian, Zhong
PY - 2000
Y1 - 2000
N2 - Detecting and eliminating duplicate records is one of the major tasks for improving data quality. The task, however, is not as trivial as it seems since various errors, such as character insertion, deletion, transposition, substitution, and word switching, are often present in real-world databases. This paper presents an n-grambased approach for detecting duplicate records in large databases. Using the approach, records are first mapped to numbers based on the n-grams of their field values. The obtained numbers are then clustered, and records within a cluster are taken as potential duplicate records. Finally, record comparisons are performed within clusters to identify true duplicate records. The unique feature of this method is that it does not require preprocessing to correct syntactic or typographical errors in the source data in order to achieve high accuracy. Moreover, sorting the source data file is unnecessary. Only a fixed number of database scans is required. Therefore, compared with previous methods, the algorithm is more time efficient.
AB - Detecting and eliminating duplicate records is one of the major tasks for improving data quality. The task, however, is not as trivial as it seems since various errors, such as character insertion, deletion, transposition, substitution, and word switching, are often present in real-world databases. This paper presents an n-grambased approach for detecting duplicate records in large databases. Using the approach, records are first mapped to numbers based on the n-grams of their field values. The obtained numbers are then clustered, and records within a cluster are taken as potential duplicate records. Finally, record comparisons are performed within clusters to identify true duplicate records. The unique feature of this method is that it does not require preprocessing to correct syntactic or typographical errors in the source data in order to achieve high accuracy. Moreover, sorting the source data file is unnecessary. Only a fixed number of database scans is required. Therefore, compared with previous methods, the algorithm is more time efficient.
KW - Data quality
KW - Duplicate elimination
KW - Edit distance
KW - N-gram
UR - https://www.scopus.com/pages/publications/77956142012
U2 - 10.1007/s007990100044
DO - 10.1007/s007990100044
M3 - 文章
AN - SCOPUS:77956142012
SN - 1432-5012
VL - 3
SP - 325
EP - 331
JO - International Journal on Digital Libraries
JF - International Journal on Digital Libraries
IS - 4
ER -