4 research outputs found
Accelerating PageRank using Partition-Centric Processing
PageRank is a fundamental link analysis algorithm that also functions as a
key representative of the performance of Sparse Matrix-Vector (SpMV)
multiplication. The traditional PageRank implementation generates fine
granularity random memory accesses resulting in large amount of wasteful DRAM
traffic and poor bandwidth utilization. In this paper, we present a novel
Partition-Centric Processing Methodology (PCPM) to compute PageRank, that
drastically reduces the amount of DRAM communication while achieving high
sustained memory bandwidth. PCPM uses a Partition-centric abstraction coupled
with the Gather-Apply-Scatter (GAS) programming model. By carefully examining
how a PCPM based implementation impacts communication characteristics of the
algorithm, we propose several system optimizations that improve the execution
time substantially. More specifically, we develop (1) a new data layout that
significantly reduces communication and random DRAM accesses, and (2) branch
avoidance mechanisms to get rid of unpredictable data-dependent branches.
We perform detailed analytical and experimental evaluation of our approach
using 6 large graphs and demonstrate an average 2.7x speedup in execution time
and 1.7x reduction in communication volume, compared to the state-of-the-art.
We also show that unlike other GAS based implementations, PCPM is able to
further reduce main memory traffic by taking advantage of intelligent node
labeling that enhances locality. Although we use PageRank as the target
application in this paper, our approach can be applied to generic SpMV
computation.Comment: Added acknowledgments. In proceedings of USENIX ATC 201
Optimizing Graph Processing and Preprocessing with Hardware Assisted Propagation Blocking
Extensive prior research has focused on alleviating the characteristic poor
cache locality of graph analytics workloads. However, graph pre-processing
tasks remain relatively unexplored. In many important scenarios, graph
pre-processing tasks can be as expensive as the downstream graph analytics
kernel. We observe that Propagation Blocking (PB), a software optimization
designed for SpMV kernels, generalizes to many graph analytics kernels as well
as common pre-processing tasks. In this work, we identify the lingering
inefficiencies of a PB execution on conventional multicores and propose
architecture support to eliminate PB's bottlenecks, further improving the
performance gains from PB. Our proposed architecture -- COBRA -- optimizes the
PB execution of both graph processing and pre-processing alike to provide
end-to-end speedups of up to 4.6x (3.5x on average)
A GraphBLAS Approach for Subgraph Counting
Subgraph counting aims to count the occurrences of a subgraph template T in a
given network G. The basic problem of computing structural properties such as
counting triangles and other subgraphs has found applications in diverse
domains. Recent biological, social, cybersecurity and sensor network
applications have motivated solving such problems on massive networks with
billions of vertices. The larger subgraph problem is known to be memory bounded
and computationally challenging to scale; the complexity grows both as a
function of T and G. In this paper, we study the non-induced tree subgraph
counting problem, propose a novel layered softwarehardware co-design approach,
and implement a shared-memory multi-threaded algorithm: 1) reducing the
complexity of the parallel color-coding algorithm by identifying and pruning
redundant graph traversal; 2) achieving a fully-vectorized implementation upon
linear algebra kernels inspired by GraphBLAS, which significantly improves
cache usage and maximizes memory bandwidth utilization. Experiments show that
our implementation improves the overall performance over the state-of-the-art
work by orders of magnitude and up to 660x for subgraph templates with size
over 12 on a dual-socket Intel(R) Xeon(R) Platinum 8160 server. We believe our
approach using GraphBLAS with optimized sparse linear algebra can be applied to
other massive subgraph counting problems and emerging high-memory bandwidth
hardware architectures.Comment: 12 page
Accurate, Efficient and Scalable Training of Graph Neural Networks
Graph Neural Networks (GNNs) are powerful deep learning models to generate
node embeddings on graphs. When applying deep GNNs on large graphs, it is still
challenging to perform training in an efficient and scalable way. We propose a
novel parallel training framework. Through sampling small subgraphs as
minibatches, we reduce training workload by orders of magnitude compared with
state-of-the-art minibatch methods. We then parallelize the key computation
steps on tightly-coupled shared memory systems. For graph sampling, we exploit
parallelism within and across sampler instances, and propose an efficient data
structure supporting concurrent accesses from samplers. The parallel sampler
theoretically achieves near-linear speedup with respect to number of processing
units. For feature propagation within subgraphs, we improve cache utilization
and reduce DRAM traffic by data partitioning. Our partitioning is a
2-approximation strategy for minimizing the communication cost compared to the
optimal. We further develop a runtime scheduler to reorder the training
operations and adjust the minibatch subgraphs to improve parallel performance.
Finally, we generalize the above parallelization strategies to support multiple
types of GNN models and graph samplers. The proposed training outperforms the
state-of-the-art in scalability, efficiency and accuracy simultaneously. On a
40-core Xeon platform, we achieve 60x speedup (with AVX) in the sampling step
and 20x speedup in the feature propagation step, compared to the serial
implementation. Our algorithm enables fast training of deeper GNNs, as
demonstrated by orders of magnitude speedup compared to the Tensorflow
implementation. We open-source our code at
https://github.com/GraphSAINT/GraphSAINT.Comment: 43 pages, 8 figures. arXiv admin note: text overlap with
arXiv:1810.1189