10 research outputs found

    A Progressive Technique for Duplicate Detection Evaluating Multiple Data Using Genetic Algorithm with Real World Objects

    Get PDF
    Here in this paper we discuss about an analysis on progressive duplicate record detection in real world data have at least two redundancy in database. Duplicate detection is strategy for recognizing all instances of various delineation of some genuine items, case client relationship administration or data mining. An agent case client relationship administration, where an organization loses cash by sending different inventories to a similar individual that would bring down consumer loyalty. Another application is Data Mining i.e to rectify input data is important to build valuable reports that from the premise of components. In this paper to learn about the progressive duplication calculation with the assistance of guide lessen to recognize the duplicates data and erase those duplicate records

    Record Duplication Detection in Database: A Review

    Get PDF
    The recognition of similar entities in databases has gained substantial attention in many application areas. Despite several techniques proposed to recognize and locate duplication of database records, there is a dearth of studies available which rate the effectiveness of the diverse techniques used for duplicate record detection. Calculating time complexity of the proposed methods reveals their performance rating. The time complexity calculation showed that the efficiency of these methods improved when blocking and windowing is applied. Some domain-specific methods train systems to optimize results and improve efficiency and scalability, but they are prone to errors. Most of the existing methods fail to either discuss, or lack thoroughness in consideration of scalability. The process of sorting and searching form an essential part of duplication detection, but they are time-consuming. Therefore this paper proposes the possibility of eliminating the sorting process by utilization of tree structure to improve the record duplication detection. This has added benefits of reducing time required, and offers a probable increase in scalability. For database system, scalability is an inherent feature for any proposed solution, due to the fact that the data size is huge. Improving the efficiency in identifying duplicate records in databases is an essential step for data cleaning and data integration methods. This paper reveals that the current proposed methods lack in providing solutions that are scalable, high accurate, and reduce the processing time during detecting duplication of records in database. The ability to provide solutions to this problem will improve the quality of data that are used for decision making process

    FDDetector: A Tool for Deduplicating Features in Software Product Lines

    Get PDF
    Duplication is one of the model defects that affect software product lines during their evolution. Many approaches have been proposed to deal with duplication in code level while duplication in features hasn’t received big interest in literature. At the aim of reducing maintenance cost and improving product quality in an early stage of a product line, we have proposed in previous work a tool support based on a conceptual framework. The main objective of this tool called FDDetector is to detect and correct duplication in product line models. In this paper, we recall the motivation behind creating a solution for feature deduplication and we present progress done in the design and implementation of FDDetector

    MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

    Get PDF
    Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001

    Event-Driven Duplicate Detection: A Probability-based Approach

    Get PDF
    The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate detection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

    A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

    Get PDF
    In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets

    Event-Driven Duplicate Detection: A probability based Approach

    Get PDF
    The importance of probability-based approaches for duplicate detection has been recognized in both research and practice. However, existing approaches do not aim to consider the underlying real-world events resulting in duplicates (e.g., that a relocation may lead to the storage of two records for the same customer, once before and after the relocation). Duplicates resulting from real-world events exhibit specific characteristics. For instance, duplicates resulting from relocations tend to have significantly different attribute values for all address-related attributes. Hence, existing approaches focusing on high similarity with respect to attribute values are hardly able to identify possible duplicates resulting from such real-world events. To address this issue, we propose an approach for event-driven duplicate de-tection based on probability theory. Our approach assigns the probability of being a duplicate resulting from real-world events to each analysed pair of records while avoiding limiting assumptions (of existing approaches). We demonstrate the practical applicability and effectiveness of our approach in a real-world setting by analysing customer master data of a German insurer. The evaluation shows that the results provided by the approach are reliable and useful for decision support and can outperform well-known state-of-the-art approaches for duplicate detection

    Indexing techniques for real-time entity resolution

    No full text
    Entity resolution (ER), which is the process of identifying records in one or several data set(s) that refer to the same real-world entity, is an important task in improving data quality and in data integration. In general, unique entity identifiers are not available in real-world data sets. Therefore, identifying attributes such as names and addresses are required to perform the ER process using approximate matching techniques. Since many services in both the private and public sectors are moving on-line, organizations increasingly require to perform real-time ER (with sub-second response times) on query records that need to be matched with existing data sets. Indexing is a major step in the ER process which aims to group similar records together using a blocking key criterion to reduce the search space. Most existing indexing techniques that are currently used with ER are static and can only be employed off-line with batch processing algorithms. A major aspect of achieving ER in real-time is to develop novel efficient and effective dynamic indexing techniques that allow dynamic updates and facilitate real-time matching. In this thesis, we focus on the indexing step in the context of real-time ER. We propose three dynamic indexing techniques and a blocking key learning algorithm to be used with real-time ER. The first index (named DySimII) is a blocking-based technique that is updated whenever a new query record arrives. We reduce the size of DySimII by proposing a frequency-filtered alteration that only indexes the most frequent attribute values. The second index (named DySNI) is a tree-based dynamic indexing technique that is tailored for real-time ER. DySNI is based on the sorted neighborhood method that is commonly used in ER. We investigate several static and adaptive window approaches when retrieving candidate records. The third index (named F-DySNI) is a multi-tree technique that uses multiple distinct trees in the index data structure where each tree has a unique sorting key. The aim of F-DySNI is to reduce the effects of errors and variations at the beginning of attribute values that are used as sorting keys on matching quality. Finally, we propose an unsupervised learning algorithm that automatically generates optimal blocking keys for building indexes that are adequate for real-time ER. We experimentally evaluate the proposed approaches using various real-world data sets with millions of records and synthetic data sets with different data characteristics. The results show that, for the growing sizes of our indexing solutions, no appreciable increase occurs in both record insertion and query times. DySNI is the fastest amongst the proposed solutions, while F-DySNI achieves better matching quality. Compared to an existing indexing baseline, our proposed techniques achieve better query times and matching quality. Moreover, our blocking key learning algorithm achieves an average query time that is around two orders of magnitude faster than an existing learning baseline while maintaining similar matching quality. Our proposed solutions are therefore shown to be suitable for real-time ER
    corecore