61 research outputs found
Provenance, Incremental Evaluation, and Debugging in Datalog
The Datalog programming language has recently found increasing traction in research and industry. Driven by its clean declarative semantics, along with its conciseness and ease of use, Datalog has been adopted for a wide range of important applications, such as program analysis, graph problems, and networking. To enable this adoption, modern Datalog engines have implemented advanced language features and high-performance evaluation of Datalog programs. Unfortunately, critical infrastructure and tooling to support Datalog users and developers are still missing. For example, there are only limited tools addressing the crucial debugging problem, where developers can spend up to 30% of their time finding and fixing bugs.
This thesis addresses Datalog’s tooling gaps, with the ultimate goal of improving the productivity of Datalog programmers. The first contribution is centered around the critical problem of debugging: we develop a new debugging approach that explains the execution steps taken to produce a faulty output. Crucially, our debugging method can be applied for large-scale applications without substantially sacrificing performance. The second contribution addresses the problem of incremental evaluation, which is necessary when program inputs change slightly, and results need to be recomputed. Incremental evaluation allows this recomputation to happen more efficiently, without discarding the previous results and recomputing from scratch. Finally, the last contribution provides a new incremental debugging approach that identifies the root causes of faulty outputs that occur after an incremental evaluation. Incremental debugging focuses on the relationship between input and output and can provide debugging suggestions to amend the inputs so that faults no longer occur. These techniques, in combination, form a corpus of critical infrastructure and tooling developments for Datalog, allowing developers and users to use Datalog more productively
Learning programs by learning from failures
We describe an inductive logic programming (ILP) approach called learning
from failures. In this approach, an ILP system (the learner) decomposes the
learning problem into three separate stages: generate, test, and constrain. In
the generate stage, the learner generates a hypothesis (a logic program) that
satisfies a set of hypothesis constraints (constraints on the syntactic form of
hypotheses). In the test stage, the learner tests the hypothesis against
training examples. A hypothesis fails when it does not entail all the positive
examples or entails a negative example. If a hypothesis fails, then, in the
constrain stage, the learner learns constraints from the failed hypothesis to
prune the hypothesis space, i.e. to constrain subsequent hypothesis generation.
For instance, if a hypothesis is too general (entails a negative example), the
constraints prune generalisations of the hypothesis. If a hypothesis is too
specific (does not entail all the positive examples), the constraints prune
specialisations of the hypothesis. This loop repeats until either (i) the
learner finds a hypothesis that entails all the positive and none of the
negative examples, or (ii) there are no more hypotheses to test. We introduce
Popper, an ILP system that implements this approach by combining answer set
programming and Prolog. Popper supports infinite problem domains, reasoning
about lists and numbers, learning textually minimal programs, and learning
recursive programs. Our experimental results on three domains (toy game
problems, robot strategies, and list transformations) show that (i) constraints
drastically improve learning performance, and (ii) Popper can outperform
existing ILP systems, both in terms of predictive accuracies and learning
times.Comment: Accepted for the machine learning journa
Semiring Provenance for B\"uchi Games: Strategy Analysis with Absorptive Polynomials
This paper presents a case study for the application of semiring semantics
for fixed-point formulae to the analysis of strategies in B\"uchi games.
Semiring semantics generalizes the classical Boolean semantics by permitting
multiple truth values from certain semirings. Evaluating the fixed-point
formula that defines the winning region in a given game in an appropriate
semiring of polynomials provides not only the Boolean information on who wins,
but also tells us how they win and which strategies they might use. This is
well-understood for reachability games, where the winning region is definable
as a least fixed point. The case of B\"uchi games is of special interest, not
only due to their practical importance, but also because it is the simplest
case where the fixed-point definition involves a genuine alternation of a
greatest and a least fixed point.
We show that, in a precise sense, semiring semantics provide information
about all absorption-dominant strategies -- strategies that win with minimal
effort, and we discuss how these relate to positional and the more general
persistent strategies. This information enables further applications such as
game synthesis or determining minimal modifications to the game needed to
change its outcome. Lastly, we discuss limitations of our approach and present
questions that cannot be immediately answered by semiring semantics.Comment: Full version of a paper submitted to GandALF 202
Incremental Processing and Optimization of Update Streams
Over the recent years, we have seen an increasing number of applications in networking, sensor networks, cloud computing, and environmental monitoring, which monitor, plan, control, and make decisions over data streams from multiple sources. We are interested in extending traditional stream processing techniques to meet the new challenges of these applications. Generally, in order to support genuine continuous query optimization and processing over data streams, we need to systematically understand how to address incremental optimization and processing of update streams for a rich class of queries commonly used in the applications.
Our general thesis is that efficient incremental processing and re-optimization of update streams can be achieved by various incremental view maintenance techniques if we cast the problems as incremental view maintenance problems over data streams. We focus on two incremental processing of update streams challenges currently not addressed in existing work on stream query processing: incremental processing of transitive closure queries over data streams, and incremental re-optimization of queries. In addition to addressing these specific challenges, we also develop a working prototype system Aspen, which serves as an end-to-end stream processing system that has been deployed as the foundation for a case study of our SmartCIS application. We validate our solutions both analytically and empirically on top of our prototype system Aspen, over a variety of benchmark workloads such as TPC-H and LinearRoad Benchmarks
Generalisation Through Negation and Predicate Invention
The ability to generalise from a small number of examples is a fundamental
challenge in machine learning. To tackle this challenge, we introduce an
inductive logic programming (ILP) approach that combines negation and predicate
invention. Combining these two features allows an ILP system to generalise
better by learning rules with universally quantified body-only variables. We
implement our idea in NOPI, which can learn normal logic programs with
predicate invention, including Datalog programs with stratified negation. Our
experimental results on multiple domains show that our approach can improve
predictive accuracies and learning times.Comment: Under peer-revie
Saggitarius: A DSL for Specifying Grammatical Domains
Common data types like dates, addresses, phone numbers and tables can have
multiple textual representations, and many heavily-used languages, such as SQL,
come in several dialects. These variations can cause data to be misinterpreted,
leading to silent data corruption, failure of data processing systems, or even
security vulnerabilities. Saggitarius is a new language and system designed to
help programmers reason about the format of data, by describing grammatical
domains -- that is, sets of context-free grammars that describe the many
possible representations of a datatype. We describe the design of Saggitarius
via example and provide a relational semantics. We show how Saggitarius may be
used to analyze a data set: given example data, it uses an algorithm based on
semi-ring parsing and MaxSAT to infer which grammar in a given domain best
matches that data. We evaluate the effectiveness of the algorithm on a
benchmark suite of 110 example problems, and we demonstrate that our system
typically returns a satisfying grammar within a few seconds with only a small
number of examples. We also delve deeper into a more extensive case study on
using Saggitarius for CSV dialect detection. Despite being general-purpose, we
find that Saggitarius offers comparable results to hand-tuned, specialized
tools; in the case of CSV, it infers grammars for 84% of benchmarks within 60
seconds, and has comparable accuracy to custom-built dialect detection tools.Comment: OOPSLA 202
- …