INDEPENDENT DE-DUPLICATION IN DATA CLEANING

Ajumobi Udechukwu; Christie Ezeife; Ken Barker

research

INDEPENDENT DE-DUPLICATION IN DATA CLEANING

Authors: Ajumobi Udechukwu
Christie Ezeife
Ken Barker
Publication date: 1 January 2005
Publisher: Faculty of Organization and Informatics University of Zagreb

Abstract

Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This paper identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The paper then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level, and a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independence at the record level. Experiments show that the proposed techniques achieve more accurate de-duplication than the existing algorithms

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Hrčak - Portal of scientific journals of Croatia

oai:hrcak.srce.hr:78279

Last time updated on 27/08/2013

HRČAK - Portal of Croatian Scientific and Professional Journals

oai:hrcak.srce.hr:78279

Last time updated on 10/12/2021