9 research outputs found

    A benchmark study of clustering based record linkage methods

    Get PDF
    Record linkage (or record matching) tries to identify the records in datasets which represent the same entity. These entities could be people or any other entity of interest. In this study, there has been processed a benchmark of clustering algorithms used in record linkage was conducted. The reason for the interest was that with the rise of the machine learning, record linkage has been considered as a classification problem with two classes of matched and unmatched pairs. The pairs to be compared are the entries in the dataset with a possible reduction of comparisons to avoid the quadratic complexity. The reason for the need for the clustering benchmark is that the experiments are processed by assuming that the experimenter has substantial training data for the classification procedure so that he can proceed in a supervised fashion. However, this is usually not the case in real life scenarios. For that reason, in this benchmarking study, the main three clustering algorithms are applied on three different datasets which are selected with different characteristics on purpose

    Event-Driven Duplicate Detection: A Probability-based Approach

    Get PDF
    The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate detection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

    Event-Driven Duplicate Detection: A probability based Approach

    Get PDF
    The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate de-tection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

    DECISION SUPPORT SYSTEM USING WEIGHTING SIMILARITY MODEL FOR CONSTRUCTING GROUND-TRUTH DATA SET

    Get PDF
    This research aims to form a ground-truth dataset in the entity-matching process used to detect duplication of records in a bibliographic database. The contribution of this research is the obtained dataset which can be used as reference in measuring and evaluating the entity matching model implemented in bibliographic databases. This aim was achieved by developing a decision support system through experts who act as decision makers in the bibliographic databases field to construct ground-truth datasets. The model used in this decision support system weights similarity by comparing each attribute of the pairwise record in the dataset. An expert who understands all characteristics of the research database can use the graphical user interface to evaluate and determine the pairwise record that meets the conditions, such as duplication of records. This research produces a ground-truth dataset using the decision support system approach

    Unsupervised Duplicate Detection Using Sample Non-Duplicates

    Get PDF
    The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Traditional scenarios for duplicate detection are data warehouses, which are populated from several data sources. Duplicate detection here is part of the data cleansing process to improve data quality for the data warehouse. More recently in application scenarios like web portals, that offer users unified access to several data sources, or meta search engines, that distribute a search to several other resources and finally merge the individual results, the problem of duplicate detection is also present. In such scenarios no long and expensive data cleansing process can be carried out, but good duplicate estimations must be available directly. The most common approaches to duplicate detection use either rules or a weighted aggregation of similarity measures between the individual attributes of potential duplicates. However, choosing the appropriate rules, similarity functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. For this reason, these approaches entail significant costs. This thesis presents an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. To this end, a refinement of the classic Fellegi-Sunter model for record linkage is developed, which makes use of these distributions to iteratively remove clear non-duplicates from the set of potential duplicates. Alternatively also machine learning methods like Support Vector Machines are used and compared with the refined Fellegi-Sunter model. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy, depending on the application. Evaluations show that the approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches

    A framework for accurate, efficient private record linkage

    Get PDF
    corecore