22 research outputs found

    Learning Expressive Linkage Rules for Entity Matching using Genetic Programming

    Get PDF
    A central problem in data integration and data cleansing is to identify pairs of entities in data sets that describe the same real-world object. Many existing methods for matching entities rely on explicit linkage rules, which specify how two entities are compared for equivalence. Unfortunately, writing accurate linkage rules by hand is a non-trivial problem that requires detailed knowledge of the involved data sets. Another important issue is the efficient execution of linkage rules. In this thesis, we propose a set of novel methods that cover the complete entity matching workflow from the generation of linkage rules using genetic programming algorithms to their efficient execution on distributed systems. First, we propose a supervised learning algorithm that is capable of generating linkage rules from a gold standard consisting of set of entity pairs that have been labeled as duplicates or non-duplicates. We show that the introduced algorithm outperforms previously proposed entity matching approaches including the state-of-the-art genetic programming approach by de Carvalho et al. and is capable of learning linkage rules that achieve a similar accuracy than the human written rule for the same problem. In order to also cover use cases for which no gold standard is available, we propose a complementary active learning algorithm that generates a gold standard interactively by asking the user to confirm or decline the equivalence of a small number of entity pairs. In the experimental evaluation, labeling at most 50 link candidates was necessary in order to match the performance that is achieved by the supervised GenLink algorithm on the entire gold standard. Finally, we propose an efficient execution workflow that can be run on cluster of multiple machines. The execution workflow employs a novel multidimensional indexing method that allows the efficient execution of learned linkage rules by reducing the number of required comparisons significantly

    Learning Linkage Rules using Genetic Programming

    No full text
    Abstract. An important problem in Linked Data is the discovery of links between entities which identify the same real world object. These links are often generated based on manually written linkage rules which specify the condition which must be fulfilled for two entities in order to be interlinked. In this paper, we present an approach to automatically generate linkage rules from a set of reference links. Our approach is based on genetic programming and has been implemented in the Silk Link Discovery Framework. It is capable of generating complex linkage rules which compare multiple properties of the entities and employ data transformations in order to normalize their values. Experimental results show that it outperforms a genetic programming approach for record deduplication recently presented by Carvalho et. al. In tests with linkage rules that have been created for our research projects our approach learned rules which achieve a similar accuracy than the original human-created linkage rule

    Learning Expressive Linkage Rules using Genetic Programming

    Full text link
    A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identi-fying such entities rely on explicit linkage rules which spec-ify the conditions that entities must fulfill in order to be considered to describe the same real-world object. In this paper, we present the GenLink algorithm for learning ex-pressive linkage rules from a set of existing reference links using genetic programming. The algorithm is capable of generating linkage rules which select discriminative proper-ties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multi-ple comparisons using non-linear aggregation functions. Our experiments show that the GenLink algorithm outperforms the state-of-the-art genetic programming approach to learn-ing linkage rules recently presented by Carvalho et. al. and is capable of learning linkage rules which achieve a similar accuracy as human written rules for the same problem. 1

    Silk - Generating RDF links while publishing or consuming linked data

    Full text link
    Abstract. The central idea of the Web of Data is to interlink data items using RDF links. However, in practice most data sources are not sufficiently interlinked with related data sources. The Silk Link Discovery Framework addresses this problem by providing tools to generate links between data items based on user-provided link specifications. It can be used by data publishers to generate links between data sets as well as by Linked Data consumers to augment Web data with additional RDF links. In this poster we present the Silk Link Discovery Framework and report on two usage examples in which we employed Silk to generate links between two data sets about movies as well as to find duplicate persons in a stream of data items that is crawled from the Web

    Efficient multidimensional blocking for link discovery without losing recall

    Full text link
    Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several different similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall

    LDSpider: An open-source crawling framework for the Web of Linked Data

    No full text
    Abstract. The Web of Linked Data is growing and currently consists of several hundred interconnected data sources altogether serving over 25 billion RDF triples to the Web. What has hampered the exploitation of this global dataspace up till now is the lack of an open-source Linked Data crawler which can be employed by Linked Data applications to localize (parts of) the dataspace for further processing. With LDSpider, we are closing this gap in the landscape of publicly available Linked Data tools. LDSpider traverses the Web of Linked Data by following RDF links between data items, it supports different crawling strategies and allows crawled data to be stored either in files or in an RDF store