1,117 research outputs found
On Large-Scale Graph Generation with Validation of Diverse Triangle Statistics at Edges and Vertices
Researchers developing implementations of distributed graph analytic
algorithms require graph generators that yield graphs sharing the challenging
characteristics of real-world graphs (small-world, scale-free, heavy-tailed
degree distribution) with efficiently calculable ground-truth solutions to the
desired output. Reproducibility for current generators used in benchmarking are
somewhat lacking in this respect due to their randomness: the output of a
desired graph analytic can only be compared to expected values and not exact
ground truth. Nonstochastic Kronecker product graphs meet these design criteria
for several graph analytics. Here we show that many flavors of triangle
participation can be cheaply calculated while generating a Kronecker product
graph. Given two medium-sized scale-free graphs with adjacency matrices and
, their Kronecker product graph has adjacency matrix . Such
graphs are highly compressible: edges are represented in memory and can be built in a distributed setting from
small data structures, making them easy to share in compressed form. Many
interesting graph calculations have worst-case complexity bounds and often these are reduced to
for Kronecker product graphs, when a Kronecker formula can be derived yielding
the sought calculation on in terms of related calculations on and .
We focus on deriving formulas for triangle participation at vertices, , a vector storing the number of triangles that every vertex is involved
in, and triangle participation at edges, , a sparse matrix storing
the number of triangles at every edge.Comment: 10 pages, 7 figures, IEEE IPDPS Graph Algorithms Building Block
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Quickly Finding a Truss in a Haystack
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k- truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of nonconforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm
Parallelizing Maximal Clique Enumeration on GPUs
We present a GPU solution for exact maximal clique enumeration (MCE) that
performs a search tree traversal following the Bron-Kerbosch algorithm. Prior
works on parallelizing MCE on GPUs perform a breadth-first traversal of the
tree, which has limited scalability because of the explosion in the number of
tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing
depth-first traversal of independent subtrees in parallel. Since MCE suffers
from high load imbalance and memory capacity requirements, we propose a worker
list for dynamic load balancing, as well as partial induced subgraphs and a
compact representation of excluded vertex sets to regulate memory consumption.
Our evaluation shows that our GPU implementation on a single GPU outperforms
the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x
(up to 16.7x), and scales efficiently to multiple GPUs. Our code has been
open-sourced to enable further research on accelerating MCE
GraphChallenge.org: Raising the Bar on Graph Analytic Performance
The rise of graph analytic systems has created a need for new ways to measure
and compare the capabilities of graph processing systems. The MIT/Amazon/IEEE
Graph Challenge has been developed to provide a well-defined community venue
for stimulating research and highlighting innovations in graph analysis
software, hardware, algorithms, and systems. GraphChallenge.org provides a wide
range of pre-parsed graph data sets, graph generators, mathematically defined
graph algorithms, example serial implementations in a variety of languages, and
specific metrics for measuring performance. Graph Challenge 2017 received 22
submissions by 111 authors from 36 organizations. The submissions highlighted
graph analytic innovations in hardware, software, algorithms, systems, and
visualization. These submissions produced many comparable performance
measurements that can be used for assessing the current state of the art of the
field. There were numerous submissions that implemented the triangle counting
challenge and resulted in over 350 distinct measurements. Analysis of these
submissions show that their execution time is a strong function of the number
of edges in the graph, , and is typically proportional to for
large values of . Combining the model fits of the submissions presents a
picture of the current state of the art of graph analysis, which is typically
edges processed per second for graphs with edges. These results
are times faster than serial implementations commonly used by many graph
analysts and underscore the importance of making these performance benefits
available to the broader community. Graph Challenge provides a clear picture of
current graph analysis systems and underscores the need for new innovations to
achieve high performance on very large graphs.Comment: 7 pages, 6 figures; submitted to IEEE HPEC Graph Challenge. arXiv
admin note: text overlap with arXiv:1708.0686
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much
attention from researchers in fields of multigrid methods and graph analysis.
Many optimization techniques have been developed for certain application fields
and computing architecture over the decades. The objective of this paper is to
provide a structured and comprehensive overview of the research on SpGEMM.
Existing optimization techniques have been grouped into different categories
based on their target problems and architectures. Covered topics include SpGEMM
applications, size prediction of result matrix, matrix partitioning and load
balancing, result accumulating, and target architecture-oriented optimization.
The rationales of different algorithms in each category are analyzed, and a
wide range of SpGEMM algorithms are summarized. This survey sufficiently
reveals the latest progress and research status of SpGEMM optimization from
1977 to 2019. More specifically, an experimentally comparative study of
existing implementations on CPU and GPU is presented. Based on our findings, we
highlight future research directions and how future studies can leverage our
findings to encourage better design and implementation.Comment: 19 pages, 11 figures, 2 tables, 4 algorithm
- …