5 research outputs found

    Distributed Holistic Clustering on Linked Data

    Full text link
    Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based on a clustering of entities that represent the same real-world object. Our clustering approach provides a compact and fused representation of entities, and can identify errors in existing links as well as many new links. We support a distributed execution of the clustering approach to achieve faster execution times and scalability for large real-world data sets. We provide a novel gold standard for multi-source clustering, and evaluate our methods with respect to effectiveness and efficiency for large data sets from the geographic and music domains

    LEAPME: learning-based property matching with embeddings

    Get PDF
    Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in challenging use cases such as the integration of heterogeneous product entities from many sources. We therefore present a new machine learning-based property matching approach called LEAPME (LEArning-based Property Matching with Embeddings) that utilizes numerous features of both property names and instance values. The approach heavily makes use of word embeddings to better utilize the domain-specific semantics of both property names and instance values. The use of supervised machine learning helps exploit the predictive power of word embeddings. Our comparative evaluation against five baselines for several multi-source datasets with real-world data shows the high effectiveness of LEAPME. We also show that our approach is even effective when training data from another domain (transfer learning) is used.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2019-105471RB-I00Junta de Andalucía P18-RT-106

    Instance-based Hierarchical Schema Alignment in Linked Data

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 치의과학과 의료경영과정보학전공, 2015. 8. 김홍기.Along with the development of Web of documents, there is a natural need for sharing, exchanging, and merging heterogeneous data to provide more comprehensive information and answer users with more complex questions. However, the data published on the Web are raw dumps that sacrifice much of the semantics that can be used for exchanging and integrating data. Resource Description Framework (RDF) and Linked Data are designed to expose the semantics of data by interlinking data represented with well-defined relations. With the profusion of RDF resources and Linked Data, ontology alignment has gained significance in providing highly comprehensive knowledge embedded in disparate sources. Ontology alignment, however, in Linking Open Data (LOD) has traditionally focused more on the instance-level rather than the schema-level. Linked Data supports schema-level matching, provided that instance-level matching is already established. Linked Data is a hotbed for instance-based schema matching, which is considered a better solution for matching classes with ambiguous or obscure names. In this dissertation, the author focuses on three issues in instance-based schema alignment for Linked Data: (1) how to align schemas based on instances, (2) how to scale the schema alignment, (3) how to generate a hierarchical schema structure. Targeting the first issue, the author has proposed an instance-based schema alignment algorithm called IUT. The IUT builds a unified taxonomy for the classes from two ontologies based on an instance-class matrix and obtains the relations of two classes by the common instances. The author tested the IUT with DBpedia and YAGO2, and compared the IUT with two state-of-the-art methods in four alignment tasks. The experiments show that the IUT outperforms the methods in terms of efficiency and effectiveness (e.g., costs 968 ms to obtain 0.810 F-score on intra-subsumption alignment in DBpedia). Targeting the second issue, the author has proposed a scaled version of the IUT called IUT(M). The IUT(M) decreases the computations of the IUT from two aspects based on Locality Sensitive Hashing (LSH): (1) decreasing the similarity computations for each pair of classes with MinHash functions, and (2) decreasing the number of similarity computations with banding. The author tested the IUT(M) with YAGO2-YAGO2 intra-subsumption alignment task to demonstrate that the running time of IUT can be reduced by 94% with a 5% loss in F-score. Targeting the third issue, the author has proposed a method to generate a faceted taxonomy based on object properties on Linked Data. A framework is proposed to build a sub-taxonomy in each facet with sub-data, extracted with an object property, with an Instance-based Concept Taxonomy generation algorithm called ICT. Two experiments demonstrate: (1) The ICT efficiently and effectively generates a sub-taxonomy with rdf:type in DBpedia and YAGO2 (e.g., costs 49 and 11,790 ms to build the concept taxonomies that achieve 0.917 and 0.780 on Taxonomic F-score). (2) The faceted taxonomies for Diseasome and DrugBank, efficiently generated based on multiple object properties (e.g., costs 2,032 and 2,525 ms to build the faceted taxonomies based on 6 and 16 properties), can effectively reduce the search spaces in faceted searches (e.g., obtains 1.65 and 1.03 on Maximum Resolution with 2 facets).1 Introduction 1 1.1 Background and Motivations 1 1.1.1 Data Integration and Schema Alignment 1 1.1.2 From RDF to Linked Data 3 1.1.3 Schema Alignment in Linked Data 5 1.2 Instance-based Schema Alignment 9 1.3 Contributions of this Dissertation 13 1.4 Organization of this Dissertation 15 2 Preliminaries and Related Works 17 2.1 Preliminaries 17 2.1.1 RDF and Linked Data 17 2.1.2 Ontology and Schema Alignment in Linked Data 20 2.2 Related Works 23 2.2.1 Instance-based Schema Alignment 23 2.2.2 Scaling Pairwise Similarity Computations 29 2.2.3 Automatic Taxonomy Generation 32 3 Aligning Schemas with Subsumption and Equivalence Relations 36 3.1 Introduction 36 3.2 Problem Definition 38 3.3 Methods 41 3.3.1 Workflow of Instance-based Schema Alignment 41 3.3.2 Instance-class Matrix Generation 42 3.3.3 Subsumption and Equivalence Relations Discovering 44 3.4 Experiments 48 3.4.1 Schema Alignment Algorithms in Comparison 48 3.4.2 Data and Experiment Design 48 3.5 Results 52 3.5.1 Intra-subsumption Relations for YAGO2-YAGO2 54 3.5.2 Intra-subsumption Relations for DBpedia-DBpedia 58 3.5.3 Inter-Subsumption and Equivalence Relations for YAGO2-DBpedia 61 3.5.4 Effects of χ_s and χ_e for the IUT 67 3.6 Discussions 71 3.7 Conclusion 75 4 Scaling Pair-wise Computations Using the Locality Sensitive Hashing 76 4.1 Introduction 76 4.2 Methods 78 4.2.1 MinHash and Signatures 79 4.2.2 Banding Technique 83 4.2.3 Scaling the IUT with MinHash and Banding 85 4.3 Experiment 87 4.4 Discussions 92 4.5 Conclusion 93 5 Unsupervised Hierarchical Schema Structure Generation in Linked Data 94 5.1 Introduction 94 5.2 Faceted Taxonomy for Linked Data 98 5.3 Framework 101 5.3.1 Facets Extraction 102 5.3.2 Instance Restriction and Redundancy Removal 102 5.3.3 Redundant Object Removal 103 5.3.4 Instance-object Matrix Generation 103 5.4 Generating Faceted Taxonomy 105 5.4.1 The Problem of Generating a Sub-taxonomy for a Facet 105 5.4.2 Concept Definition and Naming 105 5.4.3 Taxonomy Generation Algorithm 108 5.4.4 Instantiation and Taxonomy Refinement 110 5.5 Experiments 112 5.5.1 Task 1-Construction of Taxonomy with rdf:type 112 5.5.2 Task 2-Construction of Multiple Faceted Taxonomies 115 5.6 Results 119 5.6.1 Results of Task 1 119 5.6.2 Results of Task 2 124 5.7 Discussion 131 5.8 Conclusion 133 6 Future Works and Conclusion 134 6.1 Future Works 134 6.1.1 Similarity Measures for Instance-based Schema Alignment 134 6.1.2 Ontology Evolution for Instance-based Schema Alignment 135 6.1.3 Combining the IUT with Structure- and Lexical-based Methods 136 6.1.4 Scaling the IUT with Parallel Computations 137 6.1.5 Faceted Navigation and Search for Linked Data 137 6.2 Conclusion 139 Bibliography 142 초록 152Docto
    corecore