4,213 research outputs found

    Matching data detection for the integration system

    Get PDF
    The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data that can be used for the efficient execution of the query. However, we have problems with solving entities, so it is necessary to use different techniques to analyze and verify the data quality in order to obtain good data management. Then, when we have a single database, we call this mechanism deduplication. To solve the problems above, we propose in this article a method to calculate the similarity between the potential duplicate data. This solution is based on graphics technology to narrow the search field for similar features. Then, a composite mechanism is used to locate the most similar records in our database to improve the quality of the data to make good decisions from heterogeneous sources

    MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

    Get PDF
    Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001

    EFFICIENT PAIR-WISE SIMILARITY COMPUTATION USING APACHE SPARK

    Get PDF
    Entity matching is the process of identifying different manifestations of the same real world entity. These entities can be referred to as objects(string) or data instances. These entities are in turn split over several databases or clusters based on the signatures of the entities. When entity matching algorithms are performed on these databases or clusters, there is a high possibility that a particular entity pair is compared more than once. The number of comparison for any two entities depend on the number of common signatures or keys they possess. This effects the performance of any entity matching algorithm. This paper is the implementation of the algorithm written by Erhard Rahm et al. for performing redundancy free pair-wise similarity computation using MapReduce. As an improvisation to the existing implementation, this project aims to implement the algorithm in Apache Spark in standalone mode for sample of data and in cluster mode for large volume of data

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Data Matching and Deduplication Over Big Data Using Hadoop Framework

    Get PDF
    Entity Resolution is the process of matching records from more than one database that refer to the same entity. In case of a single database the process is called deduplication. This article proposes a method to solve entity resolution and deduplication problem using MapReduce over Hadoop framework. The proposed method includes data preprocessing, comparison and classification tasks indexing by standard blocking method. Our method can operate with one, two or more datasets and works with semi structured or structured data.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

    Data Matching and Deduplication Over Big Data Using Hadoop Framework

    Get PDF
    Entity Resolution is the process of matching records from more than one database that refer to the same entity. In case of a single database the process is called deduplication. This article proposes a method to solve entity resolution and deduplication problem using MapReduce over Hadoop framework. The proposed method includes data preprocessing, comparison and classification tasks indexing by standard blocking method. Our method can operate with one, two or more datasets and works with semi structured or structured data.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI
    • …
    corecore