Search CORE

9 research outputs found

Unsupervised Duplicate Detection Using Sample Non-duplicates

Author: A.P. Dempster
H. Pasula
H.B. Newcombe
I.P. Fellegi
J. Shi
K.W. Church
L. Sachs
M.A. Hernandez
M.D. Larsen
M.G. Elfeky
P. Lehti
R. Baeza-Yates
S. Russell
S. Tejada
V.I. Levenshtein
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

A benchmark study of clustering based record linkage methods

Author: Ugurlu Kerem
Uğurlu Kerem
Publication venue
Publication date: 01/01/2009
Field of study

Record linkage (or record matching) tries to identify the records in datasets which represent the same entity. These entities could be people or any other entity of interest. In this study, there has been processed a benchmark of clustering algorithms used in record linkage was conducted. The reason for the interest was that with the rise of the machine learning, record linkage has been considered as a classification problem with two classes of matched and unmatched pairs. The pairs to be compared are the entries in the dataset with a possible reduction of comparisons to avoid the quadratic complexity. The reason for the need for the clustering benchmark is that the experiments are processed by assuming that the experimenter has substantial training data for the classification procedure so that he can proceed in a supervised fashion. However, this is usually not the case in real life scenarios. For that reason, in this benchmarking study, the main three clustering algorithms are applied on three different datasets which are selected with different characteristics on purpose

Sabanci University Research Database

Event-Driven Duplicate Detection: A Probability-based Approach

Author: Heinrich Prof. Dr. Bernd
Klier Mathias
Obermeier Andreas Alexander
Schiller Alexander
Publication venue: AIS Electronic Library (AISeL)
Publication date: 28/11/2018
Field of study

The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate detection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

AIS Electronic Library (AISeL)

Event-Driven Duplicate Detection: A probability based Approach

Author: Heinrich Bernd
Klier Mathias
Obermeier Andreas
Schiller Alexander
Publication venue
Publication date: 01/09/2018
Field of study

The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate de-tection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

University of Regensburg Publication Server

DECISION SUPPORT SYSTEM USING WEIGHTING SIMILARITY MODEL FOR CONSTRUCTING GROUND-TRUTH DATA SET

Author: Amin Muhammad Miftakul
Ermatita
Lukman
Stiawan Deris
Subroto Imam Much Ibnu
Publication venue: Universitas Budi Luhur - Jakarta Indonesia
Publication date: 01/10/2022
Field of study

This research aims to form a ground-truth dataset in the entity-matching process used to detect duplication of records in a bibliographic database. The contribution of this research is the obtained dataset which can be used as reference in measuring and evaluating the entity matching model implemented in bibliographic databases. This aim was achieved by developing a decision support system through experts who act as decision makers in the bibliographic databases field to construct ground-truth datasets. The model used in this decision support system weights similarity by comparing each attribute of the pairwise record in the dataset. An expert who understands all characteristics of the research database can use the graphical user interface to evaluate and determine the pairwise record that meets the conditions, such as duplication of records. This research produces a ground-truth dataset using the decision support system approach

POLSRI REPOSITORY

Unsupervised Duplicate Detection Using Sample Non-Duplicates

Author: Lehti Patrick
Publication venue
Publication date: 01/01/2006
Field of study

The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Traditional scenarios for duplicate detection are data warehouses, which are populated from several data sources. Duplicate detection here is part of the data cleansing process to improve data quality for the data warehouse. More recently in application scenarios like web portals, that offer users unified access to several data sources, or meta search engines, that distribute a search to several other resources and finally merge the individual results, the problem of duplicate detection is also present. In such scenarios no long and expensive data cleansing process can be carried out, but good duplicate estimations must be available directly. The most common approaches to duplicate detection use either rules or a weighted aggregation of similarity measures between the individual attributes of potential duplicates. However, choosing the appropriate rules, similarity functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. For this reason, these approaches entail significant costs. This thesis presents an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. To this end, a refinement of the classic Fellegi-Sunter model for record linkage is developed, which makes use of these distributions to iteratively remove clear non-duplicates from the set of potential duplicates. Alternatively also machine learning methods like Support Vector Machines are used and compared with the refined Fellegi-Sunter model. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy, depending on the application. Evaluations show that the approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches

TUbiblio

tuprints