15 research outputs found
Bi-Dimensional Binning for Big Genomic Datasets
Binning the genome is used in order to parallelize big data operations upon regions. In this extended abstract, we comparatively evaluate the performance and scalability of Spark and SciDB implementations over datasets consisting of billions of genomic regions. In particular, we introduce an original method for binning the genome, i.e. partitioning it into sections of small sizes, and show that it outperforms conventional binning used by SciDB and closes the gap between SciDB and a Spark-based implementation. The concept of bi-dimensional binning is new and can be extended to other systems and technologies
Generalized Lineage-Aware Temporal Windows: Supporting Outer and Anti Joins in Temporal-Probabilistic Databases
The result of a temporal-probabilistic (TP) join with negation includes, at
each time point, the probability with which a tuple of a positive relation
matches none of the tuples in a negative relation , for a
given join condition . TP outer and anti joins thus resemble the
characteristics of relational outer and anti joins also in the case when there
exist time points at which input tuples from have non-zero
probabilities to be and input tuples from have non-zero
probabilities to be , respectively. For the computation of TP joins with
negation, we introduce generalized lineage-aware temporal windows, a mechanism
that binds an output interval to the lineages of all the matching valid tuples
of each input relation. We group the windows of two TP relations into three
disjoint sets based on the way attributes, lineage expressions and intervals
are produced. We compute all windows in an incremental manner, and we show that
pipelined computations allow for the direct integration of our approach into
PostgreSQL. We thereby alleviate the prevalent redundancies in the interval
computations of existing approaches, which is proven by an extensive
experimental evaluation with real-world datasets
Forward Scan based Plane Sweep Algorithm for Parallel Interval Joins
The interval join is a basic operation that finds application in temporal, spatial, and uncertain databases. Although a number of centralized and distributed algorithms have been proposed for the efficient
evaluation of interval joins, classic plane sweep approaches have not been considered at their full potential. A recent piece of related work proposes an optimized approach based on plane sweep
(PS) for modern hardware, showing that it greatly outperforms previous work. However, this approach depends on the development of a complex data structure and its parallelization has not been adequately
studied. In this paper, we explore the applicability of a largely ignored forward scan (FS) based plane sweep algorithm, which is extremely simple to implement. We propose two optimizations of FS that greatly reduce its cost, making it competitive to the state-of-the-art single-threaded PS algorithm while achieving a lower memory footprint. In addition, we show the drawbacks of a previously proposed hash-based partitioning approach for parallel join processing and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach we propose a novel breakdown of the partition join jobs into a small number of independent mini-join jobs with varying cost and manage
to avoid redundant comparisons. Finally, we show how these mini-joins can be scheduled in multiple CPU cores and propose an adaptive domain partitioning, aiming at load balancing. We include an experimental study that demonstrates the efficiency of our optimized FS and the scalability of our parallelization framework.published_or_final_versio
Leveraging range joins for the computation of overlap joins
Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals
Snapshot Semantics for Temporal Multiset Relations (Extended Version)
Snapshot semantics is widely used for evaluating queries over temporal data:
temporal relations are seen as sequences of snapshot relations, and queries are
evaluated at each snapshot. In this work, we demonstrate that current
approaches for snapshot semantics over interval-timestamped multiset relations
are subject to two bugs regarding snapshot aggregation and bag difference. We
introduce a novel temporal data model based on K-relations that overcomes these
bugs and prove it to correctly encode snapshot semantics. Furthermore, we
present an efficient implementation of our model as a database middleware and
demonstrate experimentally that our approach is competitive with native
implementations and significantly outperforms such implementations on queries
that involve aggregation.Comment: extended version of PVLDB pape
Query Results over Ongoing Databases that Remain Valid as Time Passes By (Extended Version)
Ongoing time point now is used to state that a tuple is valid from the start
point onward. For database systems ongoing time points have far-reaching
implications since they change continuously as time passes by. State-of-the-art
approaches deal with ongoing time points by instantiating them to the reference
time. The instantiation yields query results that are only valid at the chosen
time and get invalidated as time passes by. We propose a solution that keeps
ongoing time points uninstantiated during query processing. We do so by
evaluating predicates and functions at all possible reference times. This
renders query results independent of a specific reference time and yields
results that remain valid as time passes by. As query results, we propose
ongoing relations that include a reference time attribute. The value of the
reference time attribute is restricted by predicates and functions on ongoing
attributes. We describe and evaluate an efficient implementation of ongoing
data types and operations in PostgreSQL.Comment: Extended version of ICDE pape
Lineage-Aware Temporal Windows: Supporting Set Operations in Temporal-Probabilistic Databases
In temporal-probabilistic (TP) databases, the combination of the temporal and
the probabilistic dimension adds significant overhead to the computation of set
operations. Although set queries are guaranteed to yield linearly sized output
relations, existing solutions exhibit quadratic runtime complexity. They suffer
from redundant interval comparisons and additional joins for the formation of
lineage expressions. In this paper, we formally define the semantics of set
operations in TP databases and study their properties. For their efficient
computation, we introduce the lineage-aware temporal window, a mechanism that
directly binds intervals with lineage expressions. We suggest the lineage-aware
window advancer (LAWA) for producing the windows of two TP relations in
linearithmic time, and we implement all TP set operations based on LAWA. By
exploiting the flexibility of lineage-aware temporal windows, we perform direct
filtering of irrelevant intervals and finalization of output lineage
expressions and thus guarantee that no additional computational cost or buffer
space is needed. A series of experiments over both synthetic and real-world
datasets show that (a) our approach has predictable performance, depending only
on the input size and not on the number of time intervals per fact or their
overlap, and that (b) it outperforms state-of-the-art approaches in both
temporal and probabilistic databases
Leveraging range joins for the computation of overlap joins
Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals