3 research outputs found
Reproducible Domain-Specific Knowledge Graphs in the Life Sciences: a Systematic Literature Review
Knowledge graphs (KGs) are widely used for representing and organizing
structured knowledge in diverse domains. However, the creation and upkeep of
KGs pose substantial challenges. Developing a KG demands extensive expertise in
data modeling, ontology design, and data curation. Furthermore, KGs are
dynamic, requiring continuous updates and quality control to ensure accuracy
and relevance. These intricacies contribute to the considerable effort required
for their development and maintenance. One critical dimension of KGs that
warrants attention is reproducibility. The ability to replicate and validate
KGs is fundamental for ensuring the trustworthiness and sustainability of the
knowledge they represent. Reproducible KGs not only support open science by
allowing others to build upon existing knowledge but also enhance transparency
and reliability in disseminating information. Despite the growing number of
domain-specific KGs, a comprehensive analysis concerning their reproducibility
has been lacking. This paper addresses this gap by offering a general overview
of domain-specific KGs and comparing them based on various reproducibility
criteria. Our study over 19 different domains shows only eight out of 250
domain-specific KGs (3.2%) provide publicly available source code. Among these,
only one system could successfully pass our reproducibility assessment (14.3%).
These findings highlight the challenges and gaps in achieving reproducibility
across domain-specific KGs. Our finding that only 0.4% of published
domain-specific KGs are reproducible shows a clear need for further research
and a shift in cultural practices
How and Why is An Answer (Still) Correct? Maintaining Provenance in Dynamic Knowledge Graphs
Knowledge graphs (KGs) have increasingly become the backbone of many critical
knowledge-centric applications. Most large-scale KGs used in practice are
automatically constructed based on an ensemble of extraction techniques applied
over diverse data sources. Therefore, it is important to establish the
provenance of results for a query to determine how these were computed.
Provenance is shown to be useful for assigning confidence scores to the
results, for debugging the KG generation itself, and for providing answer
explanations. In many such applications, certain queries are registered as
standing queries since their answers are needed often. However, KGs keep
continuously changing due to reasons such as changes in the source data,
improvements to the extraction techniques, refinement/enrichment of
information, and so on. This brings us to the issue of efficiently maintaining
the provenance polynomials of complex graph pattern queries for dynamic and
large KGs instead of having to recompute them from scratch each time the KG is
updated. Addressing these issues, we present HUKA which uses provenance
polynomials for tracking the derivation of query results over knowledge graphs
by encoding the edges involved in generating the answer. More importantly, HUKA
also maintains these provenance polynomials in the face of updates---insertions
as well as deletions of facts---to the underlying KG. Experimental results over
large real-world KGs such as YAGO and DBpedia with various benchmark SPARQL
query workloads reveals that HUKA can be almost 50 times faster than existing
systems for provenance computation on dynamic KGs
Provenance-aware knowledge representation: A survey of data models and contextualized knowledge graphs
Expressing machine-interpretable statements in the form of subject-predicate-object triples is a well-established practice for capturing semantics of structured data. However, the standard used for representing these triples, RDF, inherently lacks the mechanism to attach provenance data, which would be crucial to make automatically generated and/or processed data authoritative. This paper is a critical review of data models, annotation frameworks, knowledge organization systems, serialization syntaxes, and algebras that enable provenance-aware RDF statements. The various approaches are assessed in terms of standard compliance, formal semantics, tuple type, vocabulary term usage, blank nodes, provenance granularity, and scalability. This can be used to advance existing solutions and help implementers to select the most suitable approach (or a combination of approaches) for their applications. Moreover, the analysis of the mechanisms and their limitations highlighted in this paper can serve as the basis for novel approaches in RDF-powered applications with increasing provenance needs