900 research outputs found

    Cloud-Scale Entity Resolution: Current State and Open Challenges

    Get PDF
    Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

    Load Balancing for Entity Matching over Big Data using Sorted Neighborhood

    Get PDF
    Entity matching also known as entity resolution, duplicate identification, reference reconciliation or record linkage and is a critically important task for data cleaning and data integration. One can think of it, as the task of finding entities matching to the same entity in the real world. These entities can belong to a single source of data, or distributed data-sources. It takes structured data as an input and process includes comparison of that structured data (entity or database record) with entities present in the knowledge base. For large-scale entity, matching data has to go through some sequence of steps, which includes Evaluation, Preprocessing, Candidate calculation and Classification. The entity matching workflow consists of two strategies: blocking (map) and matching (reduce). Blocking strategy termed as the division of a data source into partitions or blocks. Blocking is helpful to improve performance. Blocking achieves this goal restricting the set of similar entities in the same partition or block and then, comparing the same within blocks. The partitioning makes use of blocking keys and blocking keys are determined from entity\u27s attributes. Partitioning helps to partition data into blocks. Values of one or several attributes form the blocking key. Mostly, the blocking key is concatenation of prefixes of these attributes. The second part of the workflow consists of the strategy for matching. This aims to identify all matching entity pairs within the same partition. To find out matching result, one need to realize comparison result of the pair of entities. A matching strategy can use several approaches for matching and can combine similarity scores to find if the entity pair is a match or not. The entity-matching model expects the matching strategy to return the list of matching pairs of entities. Thus, by relating the structured data with their most apposite entity, entity matching tries to gain the maximum out of the existing knowledge base. One of the best solutions for Entity Matching would be Dedoop [4], which is Deduplication of Hadoop. Cartesian product causes the workload due to execution with the time complexity of O (n2) and to provide more time for matching techniques to maintain the quality, some load balancing techniques are necessary. Even after the application of blocking, the task of matching i.e. Entity Matching can still be a costly task and can take up to several days for completion if running against large datasets. The MapReduce [2] programming model is perfect to execute EM in parallel. During execution, input file split into multiple parts or chunks. Then, map phase, multiple map tasks can read those parts in parallel, which are nothing but entities. During reduce phase, based on blocking keys, these entities are redistributed among several reduce tasks. This is helpful for grouping together entities with the same blocking key and can be helpful for the application of matching in parallel

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Scalable Data Integration for Linked Data

    Get PDF
    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster
    • …
    corecore