TY - GEN
T1 - An effective and efficient data cleaning technique in large databases
AU - Zhang, Ji
AU - Liu, Han
PY - 2004
Y1 - 2004
N2 - In this paper, we will propose PC-Cleaner (PC stands for Partition Comparison), a novel technique for effective and efficient duplicate record detection in the large database collection. PC-Cleaner distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then generated by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on the PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency improvement of the cleaner based on triangle inequity of record similarity. PC-Cleaner is insensitive to the key used to sort the database and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method. PC-Cleaner is able to well solve the "Key Selection" problem and the "Low Recall" problem that the existing methods suffer.
AB - In this paper, we will propose PC-Cleaner (PC stands for Partition Comparison), a novel technique for effective and efficient duplicate record detection in the large database collection. PC-Cleaner distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then generated by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on the PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency improvement of the cleaner based on triangle inequity of record similarity. PC-Cleaner is insensitive to the key used to sort the database and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method. PC-Cleaner is able to well solve the "Key Selection" problem and the "Low Recall" problem that the existing methods suffer.
UR - http://www.scopus.com/inward/record.url?scp=12344296840&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=12344296840&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:12344296840
SN - 1932415270
SN - 9781932415278
T3 - Proceedings of the International Conference on Information and Knowledge Engineering , IKE'04
SP - 501
EP - 504
BT - Proceedings of the International Conference on Information and Knowledge Engineering, IKE'04
A2 - Arabnia, H.R.
T2 - Proceedings of the International Conference on Information and Knowledge Engineering, IKE'04
Y2 - 21 June 2004 through 24 June 2004
ER -