86 research outputs found
Provenance, Incremental Evaluation, and Debugging in Datalog
The Datalog programming language has recently found increasing traction in research and industry. Driven by its clean declarative semantics, along with its conciseness and ease of use, Datalog has been adopted for a wide range of important applications, such as program analysis, graph problems, and networking. To enable this adoption, modern Datalog engines have implemented advanced language features and high-performance evaluation of Datalog programs. Unfortunately, critical infrastructure and tooling to support Datalog users and developers are still missing. For example, there are only limited tools addressing the crucial debugging problem, where developers can spend up to 30% of their time finding and fixing bugs.
This thesis addresses Datalog’s tooling gaps, with the ultimate goal of improving the productivity of Datalog programmers. The first contribution is centered around the critical problem of debugging: we develop a new debugging approach that explains the execution steps taken to produce a faulty output. Crucially, our debugging method can be applied for large-scale applications without substantially sacrificing performance. The second contribution addresses the problem of incremental evaluation, which is necessary when program inputs change slightly, and results need to be recomputed. Incremental evaluation allows this recomputation to happen more efficiently, without discarding the previous results and recomputing from scratch. Finally, the last contribution provides a new incremental debugging approach that identifies the root causes of faulty outputs that occur after an incremental evaluation. Incremental debugging focuses on the relationship between input and output and can provide debugging suggestions to amend the inputs so that faults no longer occur. These techniques, in combination, form a corpus of critical infrastructure and tooling developments for Datalog, allowing developers and users to use Datalog more productively
Semiring Provenance for Lightweight Description Logics
We investigate semiring provenance--a successful framework originally defined
in the relational database setting--for description logics. In this context,
the ontology axioms are annotated with elements of a commutative semiring and
these annotations are propagated to the ontology consequences in a way that
reflects how they are derived. We define a provenance semantics for a language
that encompasses several lightweight description logics and show its
relationships with semantics that have been defined for ontologies annotated
with a specific kind of annotation (such as fuzzy degrees). We show that under
some restrictions on the semiring, the semantics satisfies desirable properties
(such as extending the semiring provenance defined for databases). We then
focus on the well-known why-provenance, which allows to compute the semiring
provenance for every additively and multiplicatively idempotent commutative
semiring, and for which we study the complexity of problems related to the
provenance of an axiom or a conjunctive query answer. Finally, we consider two
more restricted cases which correspond to the so-called positive Boolean
provenance and lineage in the database setting. For these cases, we exhibit
relationships with well-known notions related to explanations in description
logics and complete our complexity analysis. As a side contribution, we provide
conditions on an ELHI_bot ontology that guarantee tractable reasoning.Comment: Paper currently under review. 102 page
Answering Regular Path Queries on Workflow Provenance
This paper proposes a novel approach for efficiently evaluating regular path
queries over provenance graphs of workflows that may include recursion. The
approach assumes that an execution g of a workflow G is labeled with
query-agnostic reachability labels using an existing technique. At query time,
given g, G and a regular path query R, the approach decomposes R into a set of
subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is
rewritten so that, using the reachability labels of nodes in g, whether or not
there is a path which matches Ri between two nodes can be decided in constant
time. The results of each safe subquery are then composed, possibly with some
small unsafe remainder, to produce an answer to R. The approach results in an
algorithm that significantly reduces the number of subqueries k over existing
techniques by increasing their size and complexity, and that evaluates each
subquery in time bounded by its input and output size. Experimental results
demonstrate the benefit of this approach
- …