PC-Filter: A robust filtering technique for duplicate record detection in large databases

Ji Zhang, Tok Wang Ling, Robert M. Bruckner, Han Liu

Research output: Chapter in Book/Report/Conference proceedingChapter

3 Scopus citations

Abstract

In this paper, we will propose PC-Filter (PC stands for Partition Comparison), a robust data filter for approximately duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then constructed by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. PC-Filter is insensitive to the key used to sort the database, and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method but only with a complexity of O(N4/3). Equipping existing detection methods with PC-Filter, we are able to well solve the "Key Selection" problem, the "Scope Specification" problem and the "Low Recall" problem that the existing methods suffer from.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
EditorsFernando Galindo, Makoto Takizawa, Roland Traunmuller
PublisherSpringer Verlag
Pages486-496
Number of pages11
ISBN (Print)3540229361, 9783540229360
DOIs
StatePublished - 2004

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3180
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'PC-Filter: A robust filtering technique for duplicate record detection in large databases'. Together they form a unique fingerprint.

Cite this