3 research outputs found
Validity constraints for data analysis workflows
\ua9 2024Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden assumptions often lead to crashing tasks without a reasonable error message, poor performance in general, non-terminating executions, or silent wrong results of the DAW, to name only a few possible consequences. Searching for the causes of such errors and drawbacks in a distributed compute cluster managed by a complex infrastructure stack, where DAWs for large datasets typically are executed, can be tedious and time-consuming. We propose validity constraints (VCs) as a new concept for DAW languages to alleviate this situation. A VC is a constraint specifying logical conditions that must be fulfilled at certain times for DAW executions to be valid. When defined together with a DAW, VCs help to improve the portability, adaptability, and reusability of DAWs by making implicit assumptions explicit. Once specified, VCs can be controlled automatically by the DAW infrastructure, and violations can lead to meaningful error messages and graceful behavior (e.g., termination or invocation of repair mechanisms). We provide a broad list of possible VCs, classify them along multiple dimensions, and compare them to similar concepts one can find in related fields. We also provide a proof-of-concept implementation for the workflow system Nextflow
Optimizing Query Predicates with Disjunctions for Column Stores
Since its inception, database research has given limited attention to
optimizing predicates with disjunctions. What little past work there is has
focused on optimizations for traditional row-oriented databases. A key
difference in predicate evaluation for row stores and column stores is that
while row stores apply predicates to one record at a time, column stores apply
predicates to sets of records. Not only must the execution engine decide the
order in which to apply the predicates, but it must also decide how many times
each predicate should be applied and on which sets of records it should be
applied to. In our work, we tackle exactly this problem. We formulate, analyze,
and solve the predicate evaluation problem for column stores. Our results
include proofs about various properties of the problem, and in turn, these
properties have allowed us to derive the first polynomial-time (i.e., O(n log
n)) algorithm ShallowFish which evaluates predicates optimally for all
predicate expressions with a depth of 2 or less. We capture the exact property
which makes the problem more difficult for predicate expressions of depth 3 or
greater and propose an approximate algorithm DeepFish which outperforms
ShallowFish in these situations. Finally, we show that both ShallowFish and
DeepFish outperform the corresponding state of the art by two orders of
magnitude
Generating optimal plans for Boolean expressions
We present an algorithm that produces optimal plans to evaluate arbitrary Boolean expressions possibly containing conjunctions and disjunctions. The complexity of our algorithm is O(n3n), where n is the number of simple predicates in the Boolean expression. This complexity is far lower than that of Reinwald and Soland's algorithm (O(22(n))). This lower complexity allows us to optimize Boolean expressions with up to 16 predicates in a reasonable time. Further, opposed to many existing approaches, our algorithm fulfills all requirements necessary in the context of main memory database systems. We then use this algorithm to (1) determine the optimization potential inherent in Boolean expressions and (2) evaluate the plan quality of two heuristics proposed in the literature