131 research outputs found

    Towards efficient query processing over heterogeneous RDF interfaces

    Get PDF

    Distributed databases

    Get PDF
    Mòdul 3 del llibre Database Architecture. UOC, 20122022/202

    Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake

    Get PDF
    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution

    Decentralized Knowledge Graphs on the Web

    Get PDF

    Strategies for Managing Linked Enterprise Data

    Get PDF
    Data, information and knowledge become key assets of our 21st century economy. As a result, data and knowledge management become key tasks with regard to sustainable development and business success. Often, knowledge is not explicitly represented residing in the minds of people or scattered among a variety of data sources. Knowledge is inherently associated with semantics that conveys its meaning to a human or machine agent. The Linked Data concept facilitates the semantic integration of heterogeneous data sources. However, we still lack an effective knowledge integration strategy applicable to enterprise scenarios, which balances between large amounts of data stored in legacy information systems and data lakes as well as tailored domain specific ontologies that formally describe real-world concepts. In this thesis we investigate strategies for managing linked enterprise data analyzing how actionable knowledge can be derived from enterprise data leveraging knowledge graphs. Actionable knowledge provides valuable insights, supports decision makers with clear interpretable arguments, and keeps its inference processes explainable. The benefits of employing actionable knowledge and its coherent management strategy span from a holistic semantic representation layer of enterprise data, i.e., representing numerous data sources as one, consistent, and integrated knowledge source, to unified interaction mechanisms with other systems that are able to effectively and efficiently leverage such an actionable knowledge. Several challenges have to be addressed on different conceptual levels pursuing this goal, i.e., means for representing knowledge, semantic data integration of raw data sources and subsequent knowledge extraction, communication interfaces, and implementation. In order to tackle those challenges we present the concept of Enterprise Knowledge Graphs (EKGs), describe their characteristics and advantages compared to existing approaches. We study each challenge with regard to using EKGs and demonstrate their efficiency. In particular, EKGs are able to reduce the semantic data integration effort when processing large-scale heterogeneous datasets. Then, having built a consistent logical integration layer with heterogeneity behind the scenes, EKGs unify query processing and enable effective communication interfaces for other enterprise systems. The achieved results allow us to conclude that strategies for managing linked enterprise data based on EKGs exhibit reasonable performance, comply with enterprise requirements, and ensure integrated data and knowledge management throughout its life cycle

    Query Optimization Techniques For Scaling Up To Data Variety

    Get PDF
    Even though Data Lakes are efficient in terms of data storage, they increase the complexity of query processing; this can lead to expensive query execution. Hence, novel techniques for generating query execution plans are demanded. Those techniques have to be able to exploit the main characteristics of Data Lakes. Ontario is a federated query engine capable of processing queries over heterogeneous data sources. Ontario uses source descriptions based on RDF Molecule Templates, i.e., an abstract description of the properties belonging to the entities in the unified schema of the data in the Data Lake. This thesis proposes new heuristics tailored to the problem of query processing over heterogeneous data sources including heuristics specifically designed for certain data models. The proposed heuristics are integrated into the Ontario query optimizer. Ontario is compared to state-of-the-art RDF query engines in order to study the overhead introduced by considering heterogeneity during query processing. The results of the empirical evaluation suggest that there is no significant overhead when considering heterogeneity. Furthermore, the baseline version of Ontario is compared to two different sets of additional heuristics, i.e., heuristics specifically designed for certain data models and heuristics that do not consider the data model. The analysis of the obtained experimental results shows that source-specific heuristics are able to improve query performance. Ontario optimization techniques are able to generate effective and efficient query plans that can be executed over heterogeneous data sources in a Data Lake

    Integration of Legacy and Heterogeneous Databases

    Get PDF

    Integration of Legacy and Heterogeneous Databases

    Get PDF

    Incentive-driven QoS in peer-to-peer overlays

    Get PDF
    A well known problem in peer-to-peer overlays is that no single entity has control over the software, hardware and configuration of peers. Thus, each peer can selfishly adapt its behaviour to maximise its benefit from the overlay. This thesis is concerned with the modelling and design of incentive mechanisms for QoS-overlays: resource allocation protocols that provide strategic peers with participation incentives, while at the same time optimising the performance of the peer-to-peer distribution overlay. The contributions of this thesis are as follows. First, we present PledgeRoute, a novel contribution accounting system that can be used, along with a set of reciprocity policies, as an incentive mechanism to encourage peers to contribute resources even when users are not actively consuming overlay services. This mechanism uses a decentralised credit network, is resilient to sybil attacks, and allows peers to achieve time and space deferred contribution reciprocity. Then, we present a novel, QoS-aware resource allocation model based on Vickrey auctions that uses PledgeRoute as a substrate. It acts as an incentive mechanism by providing efficient overlay construction, while at the same time allocating increasing service quality to those peers that contribute more to the network. The model is then applied to lagsensitive chunk swarming, and some of its properties are explored for different peer delay distributions. When considering QoS overlays deployed over the best-effort Internet, the quality received by a client cannot be adjudicated completely to either its serving peer or the intervening network between them. By drawing parallels between this situation and well-known hidden action situations in microeconomics, we propose a novel scheme to ensure adherence to advertised QoS levels. We then apply it to delay-sensitive chunk distribution overlays and present the optimal contract payments required, along with a method for QoS contract enforcement through reciprocative strategies. We also present a probabilistic model for application-layer delay as a function of the prevailing network conditions. Finally, we address the incentives of managed overlays, and the prediction of their behaviour. We propose two novel models of multihoming managed overlay incentives in which overlays can freely allocate their traffic flows between different ISPs. One is obtained by optimising an overlay utility function with desired properties, while the other is designed for data-driven least-squares fitting of the cross elasticity of demand. This last model is then used to solve for ISP profit maximisation

    Towards Interoperable Research Infrastructures for Environmental and Earth Sciences

    Get PDF
    This open access book summarises the latest developments on data management in the EU H2020 ENVRIplus project, which brought together more than 20 environmental and Earth science research infrastructures into a single community. It provides readers with a systematic overview of the common challenges faced by research infrastructures and how a ‘reference model guided’ engineering approach can be used to achieve greater interoperability among such infrastructures in the environmental and earth sciences. The 20 contributions in this book are structured in 5 parts on the design, development, deployment, operation and use of research infrastructures. Part one provides an overview of the state of the art of research infrastructure and relevant e-Infrastructure technologies, part two discusses the reference model guided engineering approach, the third part presents the software and tools developed for common data management challenges, the fourth part demonstrates the software via several use cases, and the last part discusses the sustainability and future directions
    • …
    corecore