10 research outputs found
Scalable Graph Convolutional Network Training on Distributed-Memory Systems
Graph Convolutional Networks (GCNs) are extensively utilized for deep
learning on graphs. The large data sizes of graphs and their vertex features
make scalable training algorithms and distributed memory systems necessary.
Since the convolution operation on graphs induces irregular memory access
patterns, designing a memory- and communication-efficient parallel algorithm
for GCN training poses unique challenges. We propose a highly parallel training
algorithm that scales to large processor counts. In our solution, the large
adjacency and vertex-feature matrices are partitioned among processors. We
exploit the vertex-partitioning of the graph to use non-blocking point-to-point
communication operations between processors for better scalability. To further
minimize the parallelization overheads, we introduce a sparse matrix
partitioning scheme based on a hypergraph partitioning model for full-batch
training. We also propose a novel stochastic hypergraph model to encode the
expected communication volume in mini-batch training. We show the merits of the
hypergraph model, previously unexplored for GCN training, over the standard
graph partitioning model which does not accurately encode the communication
costs. Experiments performed on real-world graph datasets demonstrate that the
proposed algorithms achieve considerable speedups over alternative solutions.
The optimizations achieved on communication costs become even more pronounced
at high scalability with many processors. The performance benefits are
preserved in deeper GCNs having more layers as well as on billion-scale graphs.Comment: To appear in PVLDB'2
Partitioner Selection with EASE to Optimize Distributed Graph Processing
For distributed graph processing on massive graphs, a graph is partitioned
into multiple equally-sized parts which are distributed among machines in a
compute cluster. In the last decade, many partitioning algorithms have been
developed which differ from each other with respect to the partitioning
quality, the run-time of the partitioning and the type of graph for which they
work best. The plethora of graph partitioning algorithms makes it a challenging
task to select a partitioner for a given scenario. Different studies exist that
provide qualitative insights into the characteristics of graph partitioning
algorithms that support a selection. However, in order to enable automatic
selection, a quantitative prediction of the partitioning quality, the
partitioning run-time and the run-time of subsequent graph processing jobs is
needed. In this paper, we propose a machine learning-based approach to provide
such a quantitative prediction for different types of edge partitioning
algorithms and graph processing workloads. We show that training based on
generated graphs achieves high accuracy, which can be further improved when
using real-world data. Based on the predictions, the automatic selection
reduces the end-to-end run-time on average by 11.1% compared to a random
selection, by 17.4% compared to selecting the partitioner that yields the
lowest cut size, and by 29.1% compared to the worst strategy, respectively.
Furthermore, in 35.7% of the cases, the best strategy was selected.Comment: To appear at IEEE International Conference on Data Engineering (ICDE
2023
Optimal partitioning of directed acyclic graphs with dependent costs between clusters
Many statistical inference contexts, including Bayesian Networks (BNs),
Markov processes and Hidden Markov Models (HMMS) could be supported by
partitioning (i.e.~mapping) the underlying Directed Acyclic Graph (DAG) into
clusters. However, optimal partitioning is challenging, especially in
statistical inference as the cost to be optimised is dependent on both nodes
within a cluster, and the mapping of clusters connected via parent and/or child
nodes, which we call dependent clusters. We propose a novel algorithm called
DCMAP for optimal cluster mapping with dependent clusters. Given an arbitrarily
defined, positive cost function based on the DAG and cluster mappings, we show
that DCMAP converges to find all optimal clusters, and returns near-optimal
solutions along the way. Empirically, we find that the algorithm is
time-efficient for a DBN model of a seagrass complex system using a computation
cost function. For a 25 and 50-node DBN, the search space size was and possible cluster mappings, respectively, but
near-optimal solutions with 88\% and 72\% similarity to the optimal solution
were found at iterations 170 and 865, respectively. The first optimal solution
was found at iteration 934 , and 2256
with a cost that was 4\% and 0.2\% of the naive heuristic cost, respectively
Play like a Vertex: A Stackelberg Game Approach for Streaming Graph Partitioning
In the realm of distributed systems tasked with managing and processing
large-scale graph-structured data, optimizing graph partitioning stands as a
pivotal challenge. The primary goal is to minimize communication overhead and
runtime cost. However, alongside the computational complexity associated with
optimal graph partitioning, a critical factor to consider is memory overhead.
Real-world graphs often reach colossal sizes, making it impractical and
economically unviable to load the entire graph into memory for partitioning.
This is also a fundamental premise in distributed graph processing, where
accommodating a graph with non-distributed systems is unattainable. Currently,
existing streaming partitioning algorithms exhibit a skew-oblivious nature,
yielding satisfactory partitioning results exclusively for specific graph
types. In this paper, we propose a novel streaming partitioning algorithm, the
Skewness-aware Vertex-cut Partitioner S5P, designed to leverage the skewness
characteristics of real graphs for achieving high-quality partitioning. S5P
offers high partitioning quality by segregating the graph's edge set into two
subsets, head and tail sets. Following processing by a skewness-aware
clustering algorithm, these two subsets subsequently undergo a Stackelberg
graph game. Our extensive evaluations conducted on substantial real-world and
synthetic graphs demonstrate that, in all instances, the partitioning quality
of S5P surpasses that of existing streaming partitioning algorithms, operating
within the same load balance constraints. For example, S5P can bring up to a
51% improvement in partitioning quality compared to the top partitioner among
the baselines. Lastly, we showcase that the implementation of S5P results in up
to an 81% reduction in communication cost and a 130% increase in runtime
efficiency for distributed graph processing tasks on PowerGraph.Comment: This paper has been accepted by SIGMOD 202
DKWS: A Distributed System for Keyword Search on Massive Graphs (Complete Version)
Due to the unstructuredness and the lack of schemas of graphs, such as
knowledge graphs, social networks, and RDF graphs, keyword search for querying
such graphs has been proposed. As graphs have become voluminous, large-scale
distributed processing has attracted much interest from the database research
community. While there have been several distributed systems, distributed
querying techniques for keyword search are still limited. This paper proposes a
novel distributed keyword search system called \DKWS. First, we
\revise{present} a {\em monotonic} property with keyword search algorithms that
guarantees correct parallelization. Second, we present a keyword search
algorithm as monotonic backward and forward search phases. Moreover, we propose
new tight bounds for pruning nodes being searched. Third, we propose a {\em
notify-push} paradigm and \PINE {\em programming model} of \DKWS. The
notify-push paradigm allows {\em asynchronously} exchanging the upper bounds of
matches across the workers and the coordinator in \DKWS. The \PINE
programming model naturally fits keyword search algorithms, as they have
distinguished phases, to allow {\em preemptive} searches to mitigate staleness
in a distributed system. Finally, we investigate the performance and
effectiveness of \DKWS through experiments using real-world datasets. We find
that \DKWS is up to two orders of magnitude faster than related techniques,
and its communication costs are times smaller than those of other
techniques
The Evolution of Distributed Systems for Graph Neural Networks and their Origin in Graph Processing and Deep Learning: A Survey
Graph Neural Networks (GNNs) are an emerging research field. This specialized
Deep Neural Network (DNN) architecture is capable of processing graph
structured data and bridges the gap between graph processing and Deep Learning
(DL). As graphs are everywhere, GNNs can be applied to various domains
including recommendation systems, computer vision, natural language processing,
biology and chemistry. With the rapid growing size of real world graphs, the
need for efficient and scalable GNN training solutions has come. Consequently,
many works proposing GNN systems have emerged throughout the past few years.
However, there is an acute lack of overview, categorization and comparison of
such systems. We aim to fill this gap by summarizing and categorizing important
methods and techniques for large-scale GNN solutions. In addition, we establish
connections between GNN systems, graph processing systems and DL systems.Comment: Accepted at ACM Computing Survey
Distributed Graph Neural Network Training: A Survey
Graph neural networks (GNNs) are a type of deep learning models that are
trained on graphs and have been successfully applied in various domains.
Despite the effectiveness of GNNs, it is still challenging for GNNs to
efficiently scale to large graphs. As a remedy, distributed computing becomes a
promising solution of training large-scale GNNs, since it is able to provide
abundant computing resources. However, the dependency of graph structure
increases the difficulty of achieving high-efficiency distributed GNN training,
which suffers from the massive communication and workload imbalance. In recent
years, many efforts have been made on distributed GNN training, and an array of
training algorithms and systems have been proposed. Yet, there is a lack of
systematic review on the optimization techniques for the distributed execution
of GNN training. In this survey, we analyze three major challenges in
distributed GNN training that are massive feature communication, the loss of
model accuracy and workload imbalance. Then we introduce a new taxonomy for the
optimization techniques in distributed GNN training that address the above
challenges. The new taxonomy classifies existing techniques into four
categories that are GNN data partition, GNN batch generation, GNN execution
model, and GNN communication protocol. We carefully discuss the techniques in
each category. In the end, we summarize existing distributed GNN systems for
multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion
about the future direction on distributed GNN training
Scalable graph convolutional network training on distributed-memory systems
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs
Efficient Path Enumeration and Structural Clustering on Massive Graphs
Graph analysis plays a crucial role in understanding the relationships and structures within complex systems. This thesis focuses on addressing fundamental problems in graph analysis, including hop-constrained s-t simple path (HC-s-t path) enumeration, batch HC-s-t path query processing, and graph structural clustering (SCAN). The objective is to develop efficient and scalable distributed algorithms to tackle these challenges, particularly in the context of billion-scale graphs.
We first explore the problem of HC-s-t path enumeration. Existing solutions for this problem often suffer from inefficiency and scalability limitations, especially when dealing with billion-scale graphs. To overcome these drawbacks, we propose a novel hybrid search paradigm specifically tailored for HC-s-t path enumeration. This paradigm combines different search strategies to effectively explore the solution space. Building upon this paradigm, we devise a distributed enumeration algorithm that follows a divide-and-conquer strategy, incorporates fruitless exploration pruning, and optimizes memory consumption. Experimental evaluations on various datasets demonstrate that our algorithm achieves a significant speedup compared to existing solutions, even on datasets where they encounter out-of-memory issues.
Secondly, we address the problem of batch HC-s-t path query processing. In real-world scenarios, it is common to issue multiple HC-s-t path queries simultaneously and process them as a batch. However, existing solutions often focus on optimizing the processing performance of individual queries, disregarding the benefits of processing queries concurrently. To bridge this gap, we propose the concept of HC-s path queries, which captures the common computation among different queries. We design a two-phase HC-s path query detection algorithm to identify the shared computation for a given set of HC-s-t path queries. Based on the detected HC-s path queries, we develop an efficient HC-s-t path enumeration algorithm that effectively shares the common computation. Extensive experiments on diverse datasets validate the efficiency and scalability of our algorithm for processing multiple HC-s-t path queries concurrently.
Thirdly, we investigate the problem of graph structural clustering (SCAN) in billion-scale graphs. Existing distributed solutions for SCAN often lack efficiency or suffer from high memory consumption, making them impractical for large-scale graphs. To overcome these challenges, we propose a fine-grained clustering framework specifically tailored for SCAN. This framework enables effective identification of cohesive subgroups within a graph. Building upon this framework, we devise a distributed SCAN algorithm that minimizes communication overhead and reduces memory consumption throughout the execution. We also incorporate an effective workload balance mechanism that dynamically adjusts to handle skewed workloads. Experimental evaluations on real-world graphs demonstrate the efficiency and scalability of our proposed algorithm.
Overall, this thesis contributes novel distributed algorithms for HC-s-t path enumeration, batch HC-s-t path query processing, and graph structural clustering. The proposed algorithms address the efficiency and scalability challenges in graph analysis, particularly on billion-scale graphs. Extensive experimental evaluations validate the superiority of our algorithms compared to existing solutions, enabling efficient and scalable graph analysis in complex systems