1,310 research outputs found
Counting Triangles in Large Graphs on GPU
The clustering coefficient and the transitivity ratio are concepts often used
in network analysis, which creates a need for fast practical algorithms for
counting triangles in large graphs. Previous research in this area focused on
sequential algorithms, MapReduce parallelization, and fast approximations.
In this paper we propose a parallel triangle counting algorithm for CUDA GPU.
We describe the implementation details necessary to achieve high performance
and present the experimental evaluation of our approach. Our algorithm achieves
8 to 15 times speedup over the CPU implementation and is capable of finding 3.8
billion triangles in an 89 million edges graph in less than 10 seconds on the
Nvidia Tesla C2050 GPU.Comment: 2016 IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW
On Large-Scale Graph Generation with Validation of Diverse Triangle Statistics at Edges and Vertices
Researchers developing implementations of distributed graph analytic
algorithms require graph generators that yield graphs sharing the challenging
characteristics of real-world graphs (small-world, scale-free, heavy-tailed
degree distribution) with efficiently calculable ground-truth solutions to the
desired output. Reproducibility for current generators used in benchmarking are
somewhat lacking in this respect due to their randomness: the output of a
desired graph analytic can only be compared to expected values and not exact
ground truth. Nonstochastic Kronecker product graphs meet these design criteria
for several graph analytics. Here we show that many flavors of triangle
participation can be cheaply calculated while generating a Kronecker product
graph. Given two medium-sized scale-free graphs with adjacency matrices and
, their Kronecker product graph has adjacency matrix . Such
graphs are highly compressible: edges are represented in memory and can be built in a distributed setting from
small data structures, making them easy to share in compressed form. Many
interesting graph calculations have worst-case complexity bounds and often these are reduced to
for Kronecker product graphs, when a Kronecker formula can be derived yielding
the sought calculation on in terms of related calculations on and .
We focus on deriving formulas for triangle participation at vertices, , a vector storing the number of triangles that every vertex is involved
in, and triangle participation at edges, , a sparse matrix storing
the number of triangles at every edge.Comment: 10 pages, 7 figures, IEEE IPDPS Graph Algorithms Building Block
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Quickly Finding a Truss in a Haystack
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k- truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of nonconforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm
A minimalistic approach for fast computation of geodesic distances on triangular meshes
The computation of geodesic distances is an important research topic in
Geometry Processing and 3D Shape Analysis as it is a basic component of many
methods used in these areas. In this work, we present a minimalistic parallel
algorithm based on front propagation to compute approximate geodesic distances
on meshes. Our method is practical and simple to implement and does not require
any heavy pre-processing. The convergence of our algorithm depends on the
number of discrete level sets around the source points from which distance
information propagates. To appropriately implement our method on GPUs taking
into account memory coalescence problems, we take advantage of a graph
representation based on a breadth-first search traversal that works
harmoniously with our parallel front propagation approach. We report
experiments that show how our method scales with the size of the problem. We
compare the mean error and processing time obtained by our method with such
measures computed using other methods. Our method produces results in
competitive times with almost the same accuracy, especially for large meshes.
We also demonstrate its use for solving two classical geometry processing
problems: the regular sampling problem and the Voronoi tessellation on meshes.Comment: Preprint submitted to Computers & Graphic
Distributed Estimation of Graph 4-Profiles
We present a novel distributed algorithm for counting all four-node induced
subgraphs in a big graph. These counts, called the -profile, describe a
graph's connectivity properties and have found several uses ranging from
bioinformatics to spam detection. We also study the more complicated problem of
estimating the local -profiles centered at each vertex of the graph. The
local -profile embeds every vertex in an -dimensional space that
characterizes the local geometry of its neighborhood: vertices that connect
different clusters will have different local -profiles compared to those
that are only part of one dense cluster.
Our algorithm is a local, distributed message-passing scheme on the graph and
computes all the local -profiles in parallel. We rely on two novel
theoretical contributions: we show that local -profiles can be calculated
using compressed two-hop information and also establish novel concentration
results that show that graphs can be substantially sparsified and still retain
good approximation quality for the global -profile.
We empirically evaluate our algorithm using a distributed GraphLab
implementation that we scaled up to cores. We show that our algorithm can
compute global and local -profiles of graphs with millions of edges in a few
minutes, significantly improving upon the previous state of the art.Comment: To appear in part at WWW'1
- …