66,515 research outputs found
Partitioner Selection with EASE to Optimize Distributed Graph Processing
For distributed graph processing on massive graphs, a graph is partitioned
into multiple equally-sized parts which are distributed among machines in a
compute cluster. In the last decade, many partitioning algorithms have been
developed which differ from each other with respect to the partitioning
quality, the run-time of the partitioning and the type of graph for which they
work best. The plethora of graph partitioning algorithms makes it a challenging
task to select a partitioner for a given scenario. Different studies exist that
provide qualitative insights into the characteristics of graph partitioning
algorithms that support a selection. However, in order to enable automatic
selection, a quantitative prediction of the partitioning quality, the
partitioning run-time and the run-time of subsequent graph processing jobs is
needed. In this paper, we propose a machine learning-based approach to provide
such a quantitative prediction for different types of edge partitioning
algorithms and graph processing workloads. We show that training based on
generated graphs achieves high accuracy, which can be further improved when
using real-world data. Based on the predictions, the automatic selection
reduces the end-to-end run-time on average by 11.1% compared to a random
selection, by 17.4% compared to selecting the partitioner that yields the
lowest cut size, and by 29.1% compared to the worst strategy, respectively.
Furthermore, in 35.7% of the cases, the best strategy was selected.Comment: To appear at IEEE International Conference on Data Engineering (ICDE
2023
An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing
In this paper, we study the problem of choosing among partitioning strategies in distributed graph processing systems.To this end, we evaluate and characterize both the performance and resource usage of different partitioning strategies under various popular distributed graph processing systems, applications, input graphs, and execution environments. Through our experiments, we found that no single partitioning strategy is the best fit for all situations, and that the choice of partitioning strategy has a significant effect on resource usage and application run-time. Our experiments demonstrate that the choice of partitioning strategy depends on (1) the degree distribution of input graph, (2) the type and duration of the application, and (3) the cluster size. Based on our results, we present rules of thumb to help users pick the best partitioning strategy for their particular use cases. We present results from each system, as well as from all partitioning strategies implemented in one common system (PowerLyra).Ope
Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1
We propose an efficient and scalable architecture for processing generalized
graph-pattern queries as they are specified by the current W3C recommendation
of the SPARQL 1.1 "Query Language" component. Specifically, the class of
queries we consider consists of sets of SPARQL triple patterns with labeled
property paths. From a relational perspective, this class resolves to
conjunctive queries of relational joins with additional graph-reachability
predicates. For the scalable, i.e., distributed, processing of this kind of
queries over very large RDF collections, we develop a suitable partitioning and
indexing scheme, which allows us to shard the RDF triples over an entire
cluster of compute nodes and to process an incoming SPARQL query over all of
the relevant graph partitions (and thus compute nodes) in parallel. Unlike most
prior works in this field, we specifically aim at the unified optimization and
distributed processing of queries consisting of both relational joins and
graph-reachability predicates. All communication among the compute nodes is
established via a proprietary, asynchronous communication protocol based on the
Message Passing Interface
- …