315 research outputs found
A survey of parallel execution strategies for transitive closure and logic programs
An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In particular, hash-based fragmentation is used to distribute data to disks under the control of different processors in order to perform selections and joins in parallel. With the development of new query languages, and in particular with the definition of transitive closure queries and of more general logic programming queries, the new dimension of recursion has been added to query processing. Recursive queries are complex; at the same time, their regular structure is particularly suited for parallel execution, and parallelism may give a high efficiency gain. We survey the approaches to parallel execution of recursive queries that have been presented in the recent literature. We observe that research on parallel execution of recursive queries is separated into two distinct subareas, one focused on the transitive closure of Relational Algebra expressions, the other one focused on optimization of more general Datalog queries. Though the subareas seem radically different because of the approach and formalism used, they have many common features. This is not surprising, because most typical Datalog queries can be solved by means of the transitive closure of simple algebraic expressions. We first analyze the relationship between the transitive closure of expressions in Relational Algebra and Datalog programs. We then review sequential methods for evaluating transitive closure, distinguishing iterative and direct methods. We address the parallelization of these methods, by discussing various forms of parallelization. Data fragmentation plays an important role in obtaining parallel execution; we describe hash-based and semantic fragmentation. Finally, we consider Datalog queries, and present general methods for parallel rule execution; we recognize the similarities between these methods and the methods reviewed previously, when the former are applied to linear Datalog queries. We also provide a quantitative analysis that shows the impact of the initial data distribution on the performance of methods
A Static Analyzer for Large Safety-Critical Software
We show that abstract interpretation-based static program analysis can be
made efficient and precise enough to formally verify a class of properties for
a family of large programs with few or no false alarms. This is achieved by
refinement of a general purpose static analyzer and later adaptation to
particular programs of the family by the end-user through parametrization. This
is applied to the proof of soundness of data manipulation operations at the
machine level for periodic synchronous safety critical embedded software. The
main novelties are the design principle of static analyzers by refinement and
adaptation through parametrization, the symbolic manipulation of expressions to
improve the precision of abstract transfer functions, the octagon, ellipsoid,
and decision tree abstract domains, all with sound handling of rounding errors
in floating point computations, widening strategies (with thresholds, delayed)
and the automatic determination of the parameters (parametrized packing)
Finding Cross-rule Optimization Bugs in Datalog Engines
Datalog is a popular and widely-used declarative logic programming language.
Datalog engines apply many cross-rule optimizations; bugs in them can cause
incorrect results. To detect such optimization bugs, we propose an automated
testing approach called Incremental Rule Evaluation (IRE), which
synergistically tackles the test oracle and test case generation problem. The
core idea behind the test oracle is to compare the results of an optimized
program and a program without cross-rule optimization; any difference indicates
a bug in the Datalog engine. Our core insight is that, for an optimized,
incrementally-generated Datalog program, we can evaluate all rules individually
by constructing a reference program to disable the optimizations that are
performed among multiple rules. Incrementally generating test cases not only
allows us to apply the test oracle for every new rule generated-we also can
ensure that every newly added rule generates a non-empty result with a given
probability and eschew recomputing already-known facts. We implemented IRE as a
tool named Deopt, and evaluated Deopt on four mature Datalog engines, namely
Souffl\'e, CozoDB, Z, and DDlog, and discovered a total of 30 bugs. Of
these, 13 were logic bugs, while the remaining were crash and error bugs. Deopt
can detect all bugs found by queryFuzz, a state-of-the-art approach. Out of the
bugs identified by Deopt, queryFuzz might be unable to detect 5. Our
incremental test case generation approach is efficient; for example, for test
cases containing 60 rules, our incremental approach can produce 1.17
(for DDlog) to 31.02 (for Souffl\'e) as many valid test cases with
non-empty results as the naive random method. We believe that the simplicity
and the generality of the approach will lead to its wide adoption in practice.Comment: The ACM SIGPLAN Conference on Object Oriented Programming, Systems,
Languages, and Applications (2024), Pasadena, California, United State
Spinning Fast Iterative Data Flows
Parallel dataflow systems are a central part of most analytic pipelines for
big data. The iterative nature of many analysis and machine learning
algorithms, however, is still a challenge for current systems. While certain
types of bulk iterative algorithms are supported by novel dataflow frameworks,
these systems cannot exploit computational dependencies present in many
algorithms, such as graph algorithms. As a result, these algorithms are
inefficiently executed and have led to specialized systems based on other
paradigms, such as message passing or shared memory. We propose a method to
integrate incremental iterations, a form of workset iterations, with parallel
dataflows. After showing how to integrate bulk iterations into a dataflow
system and its optimizer, we present an extension to the programming model for
incremental iterations. The extension alleviates for the lack of mutable state
in dataflows and allows for exploiting the sparse computational dependencies
inherent in many iterative algorithms. The evaluation of a prototypical
implementation shows that those aspects lead to up to two orders of magnitude
speedup in algorithm runtime, when exploited. In our experiments, the improved
dataflow system is highly competitive with specialized systems while
maintaining a transparent and unified dataflow abstraction.Comment: VLDB201
Asynchronous Distributed Execution of Fixpoint-Based Computational Fields
Coordination is essential for dynamic distributed systems whose components exhibit interactive and autonomous behaviors. Spatially distributed, locally interacting, propagating computational fields are particularly appealing for allowing components to join and leave with little or no overhead. Computational fields are a key ingredient of aggregate programming, a promising software engineering methodology particularly relevant for the Internet of Things. In our approach, space topology is represented by a fixed graph-shaped field, namely a network with attributes on both nodes and arcs, where arcs represent interaction capabilities between nodes. We propose a SMuC calculus where mu-calculus- like modal formulas represent how the values stored in neighbor nodes should be combined to update the present node. Fixpoint operations can be understood globally as recursive definitions, or locally as asynchronous converging propagation processes. We present a distributed implementation of our calculus. The translation is first done mapping SMuC programs into normal form, purely iterative programs and then into distributed programs. Some key results are presented that show convergence of fixpoint computations under fair asynchrony and under reinitialization of nodes. The first result allows nodes to proceed at different speeds, while the second one provides robustness against certain kinds of failure. We illustrate our approach with a case study based on a disaster recovery scenario, implemented in a prototype simulator that we use to evaluate the performance of a recovery strategy
- …