65 research outputs found
Tracking Federated Queries in the Linked Data
Federated query engines allow data consumers to execute queries over the
federation of Linked Data (LD). However, as federated queries are decomposed
into potentially thousands of subqueries distributed among SPARQL endpoints,
data providers do not know federated queries, they only know subqueries they
process. Consequently, unlike warehousing approaches, LD data providers have no
access to secondary data. In this paper, we propose FETA (FEderated query
TrAcking), a query tracking algorithm that infers Basic Graph Patterns (BGPs)
processed by a federation from a shared log maintained by data providers.
Concurrent execution of thousand subqueries generated by multiple federated
query engines makes the query tracking process challenging and uncertain.
Experiments with Anapsid show that FETA is able to extract BGPs which, even in
a worst case scenario, contain BGPs of original queries
The Odyssey Approach for Optimizing Federated SPARQL Queries
Answering queries over a federation of SPARQL endpoints requires combining
data from more than one data source. Optimizing queries in such scenarios is
particularly challenging not only because of (i) the large variety of possible
query execution plans that correctly answer the query but also because (ii)
there is only limited access to statistics about schema and instance data of
remote sources. To overcome these challenges, most federated query engines rely
on heuristics to reduce the space of possible query execution plans or on
dynamic programming strategies to produce optimal plans. Nevertheless, these
plans may still exhibit a high number of intermediate results or high execution
times because of heuristics and inaccurate cost estimations. In this paper, we
present Odyssey, an approach that uses statistics that allow for a more
accurate cost estimation for federated queries and therefore enables Odyssey to
produce better query execution plans. Our experimental results show that
Odyssey produces query execution plans that are better in terms of data
transfer and execution time than state-of-the-art optimizers. Our experiments
using the FedBench benchmark show execution time gains of at least 25 times on
average.Comment: 16 pages, 10 figure
Context-Aware Information Retrieval for Enhanced Situation Awareness
In the coalition forces, users are increasingly challenged with the issues of information overload and correlation of information from heterogeneous sources. Users might need different pieces of information, ranging from information about a single building, to the resolution strategy of a global conflict. Sometimes, the time, location and past history of information access can also shape the information needs of users. Information systems need to help users pull together data from disparate sources according to their expressed needs (as represented by system queries), as well as less specific criteria. Information consumers have varying roles, tasks/missions, goals and agendas, knowledge and background, and personal preferences. These factors can be used to shape both the execution of user queries and the form in which retrieved information is packaged. However, full automation of this daunting information aggregation and customization task is not possible with existing approaches. In this paper we present an infrastructure for context-aware information retrieval to enhance situation awareness. The infrastructure provides each user with a customized, mission-oriented system that gives access to the right information from heterogeneous sources in the context of a particular task, plan and/or mission. The approach lays on five intertwined fundamental concepts, namely Workflow, Context, Ontology, Profile and Information Aggregation. The exploitation of this knowledge, using appropriate domain ontologies, will make it feasible to provide contextual assistance in various ways to the work performed according to a user’s taskrelevant information requirements. This paper formalizes these concepts and their interrelationships
Recommended from our members
Federated Query Processing
Big data plays a relevant role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Semantic web technologies have also experienced great progress, and scientific communities and practitioners have contributed to the problem of big data management with ontological models, controlled vocabularies, linked datasets, data models, query languages, as well as tools for transforming big data into knowledge from which decisions can be made. Despite the significant impact of big data and semantic web technologies, we are entering into a new era where domains like genomics are projected to grow very rapidly in the next decade. In this next era, integrating big data demands novel and scalable tools for enabling not only big data ingestion and curation but also efficient large-scale exploration and discovery. Federated query processing techniques provide a solution to scale up to large volumes of data distributed across multiple data sources. Federated query processing techniques resort to source descriptions to identify relevant data sources for a query, as well as to find efficient execution plans that minimize the total execution time of a query and maximize the completeness of the answers. This chapter summarizes the main characteristics of a federated query engine, reviews the current state of the field, and outlines the problems that still remain open and represent grand challenges for the area
Policies Composition Based on Data Usage Context
International audienceIn federated query processing, different datasets can be queried simultaneously. Each dataset has different privacy policies attached, but,which privacy policy will govern the usage of the query result? In this work we propose a mechanism, based on semantic web technologies, tocompose privacy policies. The originality of our approach is that our composition rules are based on the data usage context and deduced implicitterms
FETA: Federated QuEry TrAcking for Linked Data
International audienceFollowing the principles of Linked Data (LD), data providers are producing thousands of interlinked datasets in multiple domains including life science, government, social networking, media and publications. Federated query engines allow data consumers to query several datasets through a federation of SPARQL endpoints. However, data providers just receive subqueries resulting from the decomposition of the original federated query. Consequently, they do not know how their data are crossed with other datasets of the federation. In this paper, we propose FETA, a Federated quEry TrAcking system for LD. We consider that data providers collaborate by sharing their query logs. Then, from a fed-erated log, FETA infers Basic Graph Patterns (BGPs) containing joined triple patterns, executed among endpoints. We experimented FETA with logs produced by FedBench queries executed with Anapsid and FedX federated query engines. Experiments show that FETA is able to infer BGPs of joined triple patterns with a good precision and recall
Responsible Knowledge Management in Energy Data Ecosystems
This paper analyzes the challenges and requirements of establishing energy data ecosystems (EDEs) as data-driven infrastructures that overcome the limitations of currently fragmented energy applications. It proposes a new data- and knowledge-driven approach for management and processing. This approach aims to extend the analytics services portfolio of various energy stakeholders and achieve two-way flows of electricity and information for optimized generation, distribution, and electricity consumption. The approach is based on semantic technologies to create knowledge-based systems that will aid machines in integrating and processing resources contextually and intelligently. Thus, a paradigm shift in the energy data value chain is proposed towards transparency and the responsible management of data and knowledge exchanged by the various stakeholders of an energy data space. The approach can contribute to innovative energy management and the adoption of new business models in future energy data spaces
Responsible Knowledge Management in Energy Data Ecosystems
This paper analyzes the challenges and requirements of establishing energy data ecosystems (EDEs) as data-driven infrastructures that overcome the limitations of currently fragmented energy applications. It proposes a new data-and knowledge-driven approach for management and process-ing. This approach aims to extend the analytics services portfolio of various energy stakeholders and achieve two-way flows of electricity and information for optimized generation, distribution, and electricity consumption. The approach is based on semantic technologies to create knowledge-based systems that will aid machines in integrating and processing resources contextually and intelligently. Thus, a paradigm shift in the energy data value chain is proposed towards transparency and the responsible management of data and knowledge exchanged by the various stakeholders of an energy data space. The approach can contribute to innovative energy management and the adoption of new business models in future energy data spaces. © 2022 by the authors. Licensee MDPI, Basel, Switzerland
Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake
Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution
- …