Search CORE

25 research outputs found

Coreference detection of low quality objects

Author: A. Bronselaer
D. Gusfield
I. Fellegi
P. Lehti
P. Ravikumar
S. Tejada
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

The problem of record linkage is a widely studied problem that aims to identify coreferent (i.e. duplicate) data in a structured data source. As indicated by Winkler, a solution to the record linkage problem is only possible if the error rate is sufficiently low. In other words, in order to succesfully deduplicate a database, the objects in the database must be of sufficient quality. However, this assumption is not always feasible. In this paper, it is investigated how merging of low quality objects into one high quality object can improve the process of record linkage. This general idea is illustrated in the context of strings comparison, where strings of low quality (i.e. with a high typographical error rate) are merged into a string of high quality by using an n-dimensional Levenshtein distance matrix and compute the optimal alignment between the dirty strings. Results are presented and possible refinements are proposed

Crossref

Ghent University Academic Bibliography

Indeterministic Handling of Uncertain Decisions in Duplicate Detection

Author: Keulen Maurice van
Panse Fabian
Ritter Norbert
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2010
Field of study

In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way

University of Twente Research Information

A Multi-Layer Graphical Model for Approximate Identity Matching

Author: Wang G. Alan
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2005
Field of study

Many organizations maintain identity information for their customers, vendors, and employees, etc. However, identities being compromised cannot be retrieved effectively. In this paper we first present a case study on identity problems existing in a local police department. The study show that more than half of the sampled suspects have altered identities existing in the police information system due to deception and errors. We build a taxonomy of identity problems based on our findings. The decision to determine matching identities involves some uncertainty because of the problems identified. We propose a probability-based multi-layer graphical model to capture the uncertainty. Experiments show that the proposed model performs significantly better than the searching technique based on exact-match. With 20% of training data labeled, the model with semi-supervised learning achieved performance comparable to that of fully supervised learning

AIS Electronic Library (AISeL)

Event-Driven Duplicate Detection: A Probability-based Approach

Author: Heinrich Prof. Dr. Bernd
Klier Mathias
Obermeier Andreas Alexander
Schiller Alexander
Publication venue: AIS Electronic Library (AISeL)
Publication date: 28/11/2018
Field of study

The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate detection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

AIS Electronic Library (AISeL)

Quality and complexity measures for data linkage and deduplication

Author: C Shearer
Centre for Epidemiology and Research NSW Department of Health
CW Kelman
D Pyle
DP Bertsekas
DS Zingmond
E Rahm
HB Newcombe
I Fellegi
L Gill
MA Hernandez
ME Smith
RA Baeza-Yates
S Gomatam
S Salzberg
T Blakely
TROC Fawcett
WS Cooper
Publication venue: Springer
Publication date: 01/01/2007
Field of study

Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures

CiteSeerX

Crossref

A Graph Matching Method for Historical Census Household Linkage

Author: A.K. Elmagarmid
G. Bloothooft
J. Munkres
L. Zager
M. Sadinle
R. Hall
R. Nuray-Turan
S. Ruggles
T. Caetano
T.G. Dietterich
Z. Fu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Crossref