197 research outputs found

    Dynamic Provenance for SPARQL Update

    Get PDF
    While the Semantic Web currently can exhibit provenance information by using the W3C PROV standards, there is a "missing link" in connecting PROV to storing and querying for dynamic changes to RDF graphs using SPARQL. Solving this problem would be required for such clear use-cases as the creation of version control systems for RDF. While some provenance models and annotation techniques for storing and querying provenance data originally developed with databases or workflows in mind transfer readily to RDF and SPARQL, these techniques do not readily adapt to describing changes in dynamic RDF datasets over time. In this paper we explore how to adapt the dynamic copy-paste provenance model of Buneman et al. [2] to RDF datasets that change over time in response to SPARQL updates, how to represent the resulting provenance records themselves as RDF in a manner compatible with W3C PROV, and how the provenance information can be defined by reinterpreting SPARQL updates. The primary contribution of this paper is a semantic framework that enables the semantics of SPARQL Update to be used as the basis for a 'cut-and-paste' provenance model in a principled manner.Comment: Pre-publication version of ISWC 2014 pape

    Reasoning & Querying – State of the Art

    Get PDF
    Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF

    Four Lessons in Versatility or How Query Languages Adapt to the Web

    Get PDF
    Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”

    Enabling Complex Semantic Queries to Bioinformatics Databases through Intuitive Search Over Data

    Get PDF
    Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data already available publicly. However, the heterogene- ity of the existing data sources still poses significant challenges for achieving interoperability among biological databases. Furthermore, merely solving the technical challenges of data in- tegration, for example through the use of common data representation formats, leaves open the larger problem. Namely, the steep learning curve required for understanding the data models of each public source, as well as the technical language through which the sources can be queried and joined. As a consequence, most of the available biological data remain practically unexplored today. In this thesis, we address these problems jointly, by first introducing an ontology-based data integration solution in order to mitigate the data source heterogeneity problem. We illustrate through the concrete example of Bgee, a gene expression data source, how relational databases can be exposed as virtual Resource Description Framework (RDF) graphs, through relational-to-RDF mappings. This has the important advantage that the original data source can remain unmodified, while still becoming interoperable with external RDF sources. We complement our methods with applied case studies designed to guide domain experts in formulating expressive federated queries targeting the integrated data across the domains of evolutionary relationships and gene expression. More precisely, we introduce two com- parative analyses, first within the same domain (using orthology data from multiple, inter- operable, data sources) and second across domains, in order to study the relation between expression change and evolution rate following a duplication event. Finally, in order to bridge the semantic gap between users and data, we design and im- plement Bio-SODA, a question answering system over domain knowledge graphs, that does not require training data for translating user questions to SPARQL. Bio-SODA uses a novel ranking approach that combines syntactic and semantic similarity, while also incorporating node centrality metrics to rank candidate matches for a given user question. Our results in testing Bio-SODA across several real-world databases that span multiple domains (both within and outside bioinformatics) show that it can answer complex, multi-fact queries, be- yond the current state-of-the-art in the more well-studied open-domain question answering. -- L’intégration des données promet d’être l’un des principaux catalyseurs permettant d’extraire des nouveaux aperçus de la richesse des données biologiques déjà disponibles publiquement. Cependant, l’hétérogénéité des sources de données existantes pose encore des défis importants pour parvenir à l’interopérabilité des bases de données biologiques. De plus, en surmontant seulement les défis techniques de l’intégration des données, par exemple grâce à l’utilisation de formats standard de représentation de données, on laisse ouvert un problème encore plus grand. À savoir, la courbe d’apprentissage abrupte nécessaire pour comprendre la modéli- sation des données choisie par chaque source publique, ainsi que le langage technique par lequel les sources peuvent être interrogés et jointes. Par conséquent, la plupart des données biologiques publiquement disponibles restent pratiquement inexplorés aujourd’hui. Dans cette thèse, nous abordons l’ensemble des deux problèmes, en introduisant d’abord une solution d’intégration de données basée sur ontologies, afin d’atténuer le problème d’hété- rogénéité des sources de données. Nous montrons, à travers l’exemple de Bgee, une base de données d’expression de gènes, une approche permettant les bases de données relationnelles d’être publiés sous forme de graphes RDF (Resource Description Framework) virtuels, via des correspondances relationnel-vers-RDF (« relational-to-RDF mappings »). Cela présente l’important avantage que la source de données d’origine peut rester inchangé, tout en de- venant interopérable avec les sources RDF externes. Nous complétons nos méthodes avec des études de cas appliquées, conçues pour guider les experts du domaine dans la formulation de requêtes fédérées expressives, ciblant les don- nées intégrées dans les domaines des relations évolutionnaires et de l’expression des gènes. Plus précisément, nous introduisons deux analyses comparatives, d’abord dans le même do- maine (en utilisant des données d’orthologie provenant de plusieurs sources de données in- teropérables) et ensuite à travers des domaines interconnectés, afin d’étudier la relation entre le changement d’expression et le taux d’évolution suite à une duplication de gène. Enfin, afin de mitiger le décalage sémantique entre les utilisateurs et les données, nous concevons et implémentons Bio-SODA, un système de réponse aux questions sur des graphes de connaissances domaine-spécifique, qui ne nécessite pas de données de formation pour traduire les questions des utilisateurs vers SPARQL. Bio-SODA utilise une nouvelle ap- proche de classement qui combine la similarité syntactique et sémantique, tout en incorporant des métriques de centralité des nœuds, pour classer les possibles candidats en réponse à une question utilisateur donnée. Nos résultats suite aux tests effectués en utilisant Bio-SODA sur plusieurs bases de données à travers plusieurs domaines (tantôt liés à la bioinformatique qu’extérieurs) montrent que Bio-SODA réussit à répondre à des questions complexes, en- gendrant multiples entités, au-delà de l’état actuel de la technique en matière de systèmes de réponses aux questions sur les données structures, en particulier graphes de connaissances

    Reconciling and Using Historical Person Registers as Linked Open Data in the AcademySampo Portal and Data Service

    Get PDF
    This paper presents a method for extracting and reassembling a genealogical network automatically from a biographical register of historical people. The method is applied to a dataset of short textual biographies about all 28 000 Finnish and Swedish academic people educated in 1640-1899 in Finland. The aim is to connect and disambiguate the relatives mentioned in the biographies in order to build a continuous, genealogical network, which can be used in Digital Humanities for data and network analysis of historical academic people and their lives. An artificial neural network approach is presented for solving a supervised learning task to disambiguate relatives mentioned in the register descriptions using basic biographical information enhanced with an ontology of vocations and additional occasionally sparse genealogical information. Evaluation results of the record linkage are promising and provide novel insights into the problem of historical people register reconciliation. The outcome of the work has been used in practise as part of the in-use AcademySampo portal and linked open data service, a new member in the Sampo series of cultural heritage applications for Digital Humanities.Peer reviewe

    Supporting Explainable AI on Semantic Constraint Validation

    Get PDF
    There is a rising number of knowledge graphs available published through various sources. The enormous amount of linked data strives to give entities a semantic context. Using SHACL, the entities can be validated with respect to their context. On the other hand, an increasing usage of AI models in productive systems comes with a great responsibility in various areas. Predictive models like linear, logistic regression, and tree-based models, are still frequently used as they come with a simple structure, which allows for interpretability. However, explaining models includes verifying whether the model makes predictions based on human constraints or scientific facts. This work proposes to use the semantic context of the entities in knowledge graphs to validate predictive models with respect to user-defined constraints; therefore, providing a theoretical framework for a model-agnostic validation engine based on SHACL. In a second step, the model validation results are summarized in the case of a decision tree and visualized model-coherently. Finally, the performance of the framework is evaluated based on a Python implementation

    GraphCache: A Caching System for Graph Queries

    Get PDF
    Graph query processing is essential for graph analytics, but can be very time-consuming as it entails the NP-Complete problem of subgraph isomorphism. Traditionally, caching plays a key role in expediting query processing. We thus put forth GraphCache (GC), the first full-edged caching system for general subgraph/supergraph queries. We contribute the overall system architecture and implementation of GC. We study a number of novel graph cache replacement policies and show that different policies win over different graph datasets and/or queries; we therefore contribute a novel hybrid graph replacement policy that is always the best or near-best performer. Moreover, we discover the related problem of cache pollution and propose a novel cache admission control mechanism to avoid cache pollution. Furthermore, we show that GC can be used as a front end, complementing any graph query processing method as a pluggable component. Currently, GC comes bundled with 3 top-performing filter-then-verify (FTV) subgraph query methods and 3 well-established direct subgraph-isomorphism (SI) algorithms - representing different categories of graph query processing research. Finally, we contribute a comprehensive performance evaluation of GC. We employ more than 6 million queries, generated using different workload generators, and executed against both real-world and synthetic graph datasets of different characteristics, quantifying the benefits and overheads, emphasizing the non-trivial lessons learned

    Bridging the Gap Between Ontology and Lexicon via Class-Specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus

    Get PDF
    There is a well-known lexical gap between content expressed in the form of natural language (NL) texts and content stored in an RDF knowledge base (KB). For tasks such as Information Extraction (IE), this gap needs to be bridged from NL to KB, so that facts extracted from text can be represented in RDF and can then be added to an RDF KB. For tasks such as Natural Language Generation, this gap needs to be bridged from KB to NL, so that facts stored in an RDF KB can be verbalized and read by humans. In this paper we propose LexExMachina, a new methodology that induces correspondences between lexical elements and KB elements by mining class-specific association rules. As an example of such an association rule, consider the rule that predicts that if the text about a person contains the token "Greek", then this person has the relation nationality to the entity Greece. Another rule predicts that if the text about a settlement contains the token "Greek", then this settlement has the relation country to the entity Greece. Such a rule can help in question answering, as it maps an adjective to the relevant KB terms, and it can help in information extraction from text. We propose and empirically investigate a set of 20 types of class-specific association rules together with different interestingness measures to rank them. We apply our method on a loosely-parallel text-data corpus that consists of data from DBpedia and texts from Wikipedia, and evaluate and provide empirical evidence for the utility of the rules for Question Answering
    corecore