108,182 research outputs found
GYM: A Multiround Distributed Join Algorithm
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for several rounds for the problem of computing the equijoin of n relations. Given any query Q with width w, intersection width iw, input size IN, output size OUT, and a cluster of machines with M=Omega(IN frac{1}{epsilon}) memory available per machine, where epsilon > 1 and w ge 1 are constants, we show that:
1. Q can be computed in O(n) rounds with O(n(INw + OUT)2/M) communication cost with high probability.
Q can be computed in O(log(n)) rounds with O(n(INmax(w, 3iw) + OUT)2/M) communication cost with high probability.
Intersection width is a new notion we introduce for queries and generalized hypertree decompositions (GHDs) of queries that captures how connected the adjacent components of the GHDs are.
We achieve our first result by introducing a distributed and generalized version of Yannakakis\u27s algorithm, called GYM. GYM takes as input any GHD of Q with width w and depth d, and computes Q in O(d + log(n)) rounds and O(n (INw + OUT)2/M) communication cost. We achieve our second result by showing how to construct GHDs of Q with width max(w, 3iw) and depth O(log(n)). We describe another technique to construct GHDs with longer widths and lower depths, demonstrating other tradeoffs one can make between communication and the number of rounds
DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs
Subgraph listing is a fundamental problem in graph theory and has wide
applications in areas like sociology, chemistry, and social networks. Modern
graphs can usually be large-scale as well as highly dynamic, which challenges
the efficiency of existing subgraph listing algorithms. Recent works have shown
the benefits of partitioning and processing big graphs in a distributed system,
however, there is only few work targets subgraph listing on dynamic graphs in a
distributed environment. In this paper, we propose an efficient approach,
called Distributed and Dynamic Subgraph Listing (DDSL), which can incrementally
update the results instead of running from scratch. DDSL follows a general
distributed join framework. In this framework, we use a Neighbor-Preserved
storage for data graphs, which takes bounded extra space and supports dynamic
updating. After that, we propose a comprehensive cost model to estimate the I/O
cost of listing subgraphs. Then based on this cost model, we develop an
algorithm to find the optimal join tree for a given pattern. To handle dynamic
graphs, we propose an efficient left-deep join algorithm to incrementally
update the join results. Extensive experiments are conducted on real-world
datasets. The results show that DDSL outperforms existing methods in dealing
with both static dynamic graphs in terms of the responding time
Three-Way Joins on MapReduce: An Experimental Study
We study three-way joins on MapReduce. Joins are very useful in a multitude
of applications from data integration and traversing social networks, to mining
graphs and automata-based constructions. However, joins are expensive, even for
moderate data sets; we need efficient algorithms to perform distributed
computation of joins using clusters of many machines. MapReduce has become an
increasingly popular distributed computing system and programming paradigm. We
consider a state-of-the-art MapReduce multi-way join algorithm by Afrati and
Ullman and show when it is appropriate for use on very large data sets. By
providing a detailed experimental study, we demonstrate that this algorithm
scales much better than what is suggested by the original paper. However, if
the join result needs to be summarized or aggregated, as opposed to being only
enumerated, then the aggregation step can be integrated into a cascade of
two-way joins, making it more efficient than the other algorithm, and thus
becomes the preferred solution.Comment: 6 page
Storage and Search in Dynamic Peer-to-Peer Networks
We study robust and efficient distributed algorithms for searching, storing,
and maintaining data in dynamic Peer-to-Peer (P2P) networks. P2P networks are
highly dynamic networks that experience heavy node churn (i.e., nodes join and
leave the network continuously over time). Our goal is to guarantee, despite
high node churn rate, that a large number of nodes in the network can store,
retrieve, and maintain a large number of data items. Our main contributions are
fast randomized distributed algorithms that guarantee the above with high
probability (whp) even under high adversarial churn:
1. A randomized distributed search algorithm that (whp) guarantees that
searches from as many as nodes ( is the stable network size)
succeed in -rounds despite churn, for
any small constant , per round. We assume that the churn is
controlled by an oblivious adversary (that has complete knowledge and control
of what nodes join and leave and at what time, but is oblivious to the random
choices made by the algorithm).
2. A storage and maintenance algorithm that guarantees (whp) data items can
be efficiently stored (with only copies of each data item)
and maintained in a dynamic P2P network with churn rate up to
per round. Our search algorithm together with our
storage and maintenance algorithm guarantees that as many as nodes
can efficiently store, maintain, and search even under churn per round. Our algorithms require only polylogarithmic in bits to
be processed and sent (per round) by each node.
To the best of our knowledge, our algorithms are the first-known,
fully-distributed storage and search algorithms that provably work under highly
dynamic settings (i.e., high churn rates per step).Comment: to appear at SPAA 201
Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster
The availability of large number of processing nodes in a parallel and
distributed computing environment enables sophisticated real time processing
over high speed data streams, as required by many emerging applications.
Sliding window stream joins are among the most important operators in a stream
processing system. In this paper, we consider the issue of parallelizing a
sliding window stream join operator over a shared nothing cluster. We propose a
framework, based on fixed or predefined communication pattern, to distribute
the join processing loads over the shared-nothing cluster. We consider various
overheads while scaling over a large number of nodes, and propose solution
methodologies to cope with the issues. We implement the algorithm over a
cluster using a message passing system, and present the experimental results
showing the effectiveness of the join processing algorithm.Comment: 11 page
An evaluation of a pipeline planner for a filter based algorithm.
Distributed query processing is one of the technical problems that need to be solved in Distributed Database Management Systems. Query Processing deals with designing algorithms that analyze queries and converts them into a series of data manipulation operations. The problem is how to decide on a strategy for executing each query over the network in the most cost effective way. Through the past years, the research focus in distributed query processing has been on how to realize join operations with different operators such as Semi-join, Two-way Semijoin, and Pipeline N-Way joins. However, these operations will be executed sequentially, which may increase the data transfer cost. A new algorithm, filter based pipeline N-way join algorithm, is presented to reduced data transfer cost. It makes use of filter concept and ensures the lower data access cost. This algorithm has three phases. Phase One: Use bloom filter to do forward semijoin and build tuple connectors. Phase Two: Do backward semijoin and build pipeline cache planner. Phase Three: Send Pipeline Cache Planner to query site. The main goal for this new algorithm is to reduce data transfer cost while maintain low I/O cost as pipeline N-way join algorithm. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .C36. Source: Masters Abstracts International, Volume: 43-01, page: 0231. Adviser: Joan Morrissey. Thesis (M.Sc.)--University of Windsor (Canada), 2004
- …