18,730 research outputs found
A survey of parallel execution strategies for transitive closure and logic programs
An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In particular, hash-based fragmentation is used to distribute data to disks under the control of different processors in order to perform selections and joins in parallel. With the development of new query languages, and in particular with the definition of transitive closure queries and of more general logic programming queries, the new dimension of recursion has been added to query processing. Recursive queries are complex; at the same time, their regular structure is particularly suited for parallel execution, and parallelism may give a high efficiency gain. We survey the approaches to parallel execution of recursive queries that have been presented in the recent literature. We observe that research on parallel execution of recursive queries is separated into two distinct subareas, one focused on the transitive closure of Relational Algebra expressions, the other one focused on optimization of more general Datalog queries. Though the subareas seem radically different because of the approach and formalism used, they have many common features. This is not surprising, because most typical Datalog queries can be solved by means of the transitive closure of simple algebraic expressions. We first analyze the relationship between the transitive closure of expressions in Relational Algebra and Datalog programs. We then review sequential methods for evaluating transitive closure, distinguishing iterative and direct methods. We address the parallelization of these methods, by discussing various forms of parallelization. Data fragmentation plays an important role in obtaining parallel execution; we describe hash-based and semantic fragmentation. Finally, we consider Datalog queries, and present general methods for parallel rule execution; we recognize the similarities between these methods and the methods reviewed previously, when the former are applied to linear Datalog queries. We also provide a quantitative analysis that shows the impact of the initial data distribution on the performance of methods
Tree Projections and Constraint Optimization Problems: Fixed-Parameter Tractability and Parallel Algorithms
Tree projections provide a unifying framework to deal with most structural
decomposition methods of constraint satisfaction problems (CSPs). Within this
framework, a CSP instance is decomposed into a number of sub-problems, called
views, whose solutions are either already available or can be computed
efficiently. The goal is to arrange portions of these views in a tree-like
structure, called tree projection, which determines an efficiently solvable CSP
instance equivalent to the original one. Deciding whether a tree projection
exists is NP-hard. Solution methods have therefore been proposed in the
literature that do not require a tree projection to be given, and that either
correctly decide whether the given CSP instance is satisfiable, or return that
a tree projection actually does not exist. These approaches had not been
generalized so far on CSP extensions for optimization problems, where the goal
is to compute a solution of maximum value/minimum cost. The paper fills the
gap, by exhibiting a fixed-parameter polynomial-time algorithm that either
disproves the existence of tree projections or computes an optimal solution,
with the parameter being the size of the expression of the objective function
to be optimized over all possible solutions (and not the size of the whole
constraint formula, used in related works). Tractability results are also
established for the problem of returning the best K solutions. Finally,
parallel algorithms for such optimization problems are proposed and analyzed.
Given that the classes of acyclic hypergraphs, hypergraphs of bounded
treewidth, and hypergraphs of bounded generalized hypertree width are all
covered as special cases of the tree projection framework, the results in this
paper directly apply to these classes. These classes are extensively considered
in the CSP setting, as well as in conjunctive database query evaluation and
optimization
Finite Boolean Algebras for Solid Geometry using Julia's Sparse Arrays
The goal of this paper is to introduce a new method in computer-aided
geometry of solid modeling. We put forth a novel algebraic technique to
evaluate any variadic expression between polyhedral d-solids (d = 2, 3) with
regularized operators of union, intersection, and difference, i.e., any CSG
tree. The result is obtained in three steps: first, by computing an independent
set of generators for the d-space partition induced by the input; then, by
reducing the solid expression to an equivalent logical formula between Boolean
terms made by zeros and ones; and, finally, by evaluating this expression using
bitwise operators. This method is implemented in Julia using sparse arrays. The
computational evaluation of every possible solid expression, usually denoted as
CSG (Constructive Solid Geometry), is reduced to an equivalent logical
expression of a finite set algebra over the cells of a space partition, and
solved by native bitwise operators.Comment: revised version submitted to Computer-Aided Geometric Desig
Four Lessons in Versatility or How Query Languages Adapt to the Web
Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”
An R*-Tree Based Semi-Dynamic Clustering Method for the Efficient Processing of Spatial Join in a Shared-Nothing Parallel Database System
The growing importance of geospatial databases has made it essential to perform complex spatial queries efficiently. To achieve acceptable performance levels, database systems have been increasingly required to make use of parallelism. The spatial join is a computationally expensive operator. Efficient implementation of the join operator is, thus, desirable. The work presented in this document attempts to improve the performance of spatial join queries by distributing the data set across several nodes of a cluster and executing queries across these nodes in parallel. This document discusses a new parallel algorithm that implements the spatial join in an efficient manner. This algorithm is compared to an existing parallel spatial-join algorithm, the clone join. Both algorithms have been implemented on a Beowulf cluster and compared using real datasets. An extensive experimental analysis reveals that the proposed algorithm exhibits superior performance both in declustering time as well as in the execution time of the join query
Instance and Output Optimal Parallel Algorithms for Acyclic Joins
Massively parallel join algorithms have received much attention in recent
years, while most prior work has focused on worst-optimal algorithms. However,
the worst-case optimality of these join algorithms relies on hard instances
having very large output sizes, which rarely appear in practice. A stronger
notion of optimality is {\em output-optimal}, which requires an algorithm to be
optimal within the class of all instances sharing the same input and output
size. An even stronger optimality is {\em instance-optimal}, i.e., the
algorithm is optimal on every single instance, but this may not always be
achievable.
In the traditional RAM model of computation, the classical Yannakakis
algorithm is instance-optimal on any acyclic join. But in the massively
parallel computation (MPC) model, the situation becomes much more complicated.
We first show that for the class of r-hierarchical joins, instance-optimality
can still be achieved in the MPC model. Then, we give a new MPC algorithm for
an arbitrary acyclic join with load O ({\IN \over p} + {\sqrt{\IN \cdot \OUT}
\over p}), where \IN,\OUT are the input and output sizes of the join, and
is the number of servers in the MPC model. This improves the MPC version of
the Yannakakis algorithm by an O (\sqrt{\OUT \over \IN} ) factor.
Furthermore, we show that this is output-optimal when \OUT = O(p \cdot \IN),
for every acyclic but non-r-hierarchical join. Finally, we give the first
output-sensitive lower bound for the triangle join in the MPC model, showing
that it is inherently more difficult than acyclic joins
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
- …