5 research outputs found

    Scalable Data Integration for Linked Data

    Get PDF
    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster

    A Hybrid Approach to Logic Evaluation

    Get PDF
    In this thesis, we contribute the hybrid approach – a means of combining the practical advantages of feature-rich logic evaluation in the cloud, with the performance benefits of hand-written, optimized, efficient native code. In the first part of our hybrid approach, we introduce a cloud-based distribution for logic programs, which may be deployed as a service, in standard cloud environments, across cheap commodity hardware. Modern systems are in the cloud; while distributed logic solvers exist, these systems are highly specialized, requiring expensive, resource intensive hardware infrastructures. Our original technique achieves a fully automatic synthesis of cloud infrastructure for logic programs, and includes a range of practical features not present in existing distributed logic solvers. We show that an implementation of the distribution scales effectively within real-world cloud environments, against a distribution over cores of the same machine. We show that our multi-node distribution may be effectively combined with existing multi-threaded techniques to mitigate the network communication cost incurred by distribution. In the second part of our hybrid approach, we introduce extra-logical algorithms, to achieve performance for logic programs that would not be possible within a bottom-up logic evaluation. Modern systems must deliver high performance on big data; however, even the most powerful logic engines, distributed or otherwise, can be beaten by hand-written code on particular problems. We give a novel implementation of a system for the high-impact problem of sink-reachability, designed such that its algorithms may be used in logic programs. A thorough empirical evaluation, across a range of large-scale, real-world datasets, shows our system outperforms the current state of the art for the sink-reachability problem in all cases. Our hybrid approach addresses the two major deficiencies of modern logic systems, providing a practical means of evaluating logic in distributed cloud-based environments, while offering performance gains for specific high-impact problems that would not be possible using logic programming alone
    corecore