6 research outputs found
Load Balancing and Skew Resilience for Parallel Joins
We address the problem of load balancing for parallel joins. We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms state-of-the-art techniques by up to a factor of 15
Scalable and Adaptive Online Joins
Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly. This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3.75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time
Peregrine: A Pattern-Aware Graph Mining System
Graph mining workloads aim to extract structural properties of a graph by
exploring its subgraph structures. General purpose graph mining systems provide
a generic runtime to explore subgraph structures of interest with the help of
user-defined functions that guide the overall exploration process. However, the
state-of-the-art graph mining systems remain largely oblivious to the shape (or
pattern) of the subgraphs that they mine. This causes them to: (a) explore
unnecessary subgraphs; (b) perform expensive computations on the explored
subgraphs; and, (c) hold intermediate partial subgraphs in memory; all of which
affect their overall performance. Furthermore, their programming models are
often tied to their underlying exploration strategies, which makes it difficult
for domain users to express complex mining tasks.
In this paper, we develop Peregrine, a pattern-aware graph mining system that
directly explores the subgraphs of interest while avoiding exploration of
unnecessary subgraphs, and simultaneously bypassing expensive computations
throughout the mining process. We design a pattern-based programming model that
treats "graph patterns" as first class constructs and enables Peregrine to
extract the semantics of patterns, which it uses to guide its exploration. Our
evaluation shows that Peregrine outperforms state-of-the-art distributed and
single machine graph mining systems, and scales to complex mining tasks on
larger graphs, while retaining simplicity and expressivity with its
"pattern-first" programming approach.Comment: This is the full version of the paper appearing in the European
Conference on Computer Systems (EuroSys), 202
Scalable and Adaptive Online Joins
Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly. This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3.75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time. 1