Search CORE

1,310 research outputs found

Counting Triangles in Large Graphs on GPU

Author: Polak Adam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations. In this paper we propose a parallel triangle counting algorithm for CUDA GPU. We describe the implementation details necessary to achieve high performance and present the experimental evaluation of our approach. Our algorithm achieves 8 to 15 times speedup over the CPU implementation and is capable of finding 3.8 billion triangles in an 89 million edges graph in less than 10 seconds on the Nvidia Tesla C2050 GPU.Comment: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW

arXiv.org e-Print Archive

Crossref

Jagiellonian Univeristy Repository

On Large-Scale Graph Generation with Validation of Diverse Triangle Statistics at Edges and Vertices

Author: Kepner Jeremy
La Fond Timothy
Pearce Roger
Sanders Geoffrey
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/03/2018
Field of study

Researchers developing implementations of distributed graph analytic algorithms require graph generators that yield graphs sharing the challenging characteristics of real-world graphs (small-world, scale-free, heavy-tailed degree distribution) with efficiently calculable ground-truth solutions to the desired output. Reproducibility for current generators used in benchmarking are somewhat lacking in this respect due to their randomness: the output of a desired graph analytic can only be compared to expected values and not exact ground truth. Nonstochastic Kronecker product graphs meet these design criteria for several graph analytics. Here we show that many flavors of triangle participation can be cheaply calculated while generating a Kronecker product graph. Given two medium-sized scale-free graphs with adjacency matrices

A

and

B

, their Kronecker product graph has adjacency matrix

C = A \otimes B

. Such graphs are highly compressible:

|{\cal E}|

edges are represented in

{\cal O}(|{\cal E}|^{1/2})

memory and can be built in a distributed setting from small data structures, making them easy to share in compressed form. Many interesting graph calculations have worst-case complexity bounds

{\cal O}(|{\cal E}|^p)

and often these are reduced to

{\cal O}(|{\cal E}|^{p/2})

for Kronecker product graphs, when a Kronecker formula can be derived yielding the sought calculation on

C

in terms of related calculations on

A

and

B

. We focus on deriving formulas for triangle participation at vertices,

{\bf t}_C

, a vector storing the number of triangles that every vertex is involved in, and triangle participation at edges,

\Delta_C

, a sparse matrix storing the number of triangles at every edge.Comment: 10 pages, 7 figures, IEEE IPDPS Graph Algorithms Building Block

arXiv.org e-Print Archive

Crossref

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

Mixing multi-core CPUs and GPUs for scientific simulation software

Author: Hawick K.A.
Leist A.
Playne D.P.
Publication venue: 'Massey University'
Publication date: 01/01/2010
Field of study

Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

Massey Research Online

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Author: Buluc Aydin
Owens John D.
Yang Carl
Publication venue
Publication date: 14/11/2020
Field of study

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

arXiv.org e-Print Archive

eScholarship - University of California

Quickly Finding a Truss in a Haystack

Author: Bader David
Bombieri Nicola
Busato Federico
Fox James
Green Oded
Kannan Rajgopal
Kim Euna
Lakhotia Kartik
Prasanna Viktor
Singapura Shreyas
Zeng Hanqing
Zhou Shijie
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k- truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of nonconforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm

Crossref

Catalogo dei prodotti della ricerca

A minimalistic approach for fast computation of geodesic distances on triangular meshes

Author: Calla Luciano A. Romero
Montenegro Anselmo A.
Perez Lizeth J. Fuentes
Publication venue: 'Elsevier BV'
Publication date: 23/08/2019
Field of study

The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic parallel algorithm based on front propagation to compute approximate geodesic distances on meshes. Our method is practical and simple to implement and does not require any heavy pre-processing. The convergence of our algorithm depends on the number of discrete level sets around the source points from which distance information propagates. To appropriately implement our method on GPUs taking into account memory coalescence problems, we take advantage of a graph representation based on a breadth-first search traversal that works harmoniously with our parallel front propagation approach. We report experiments that show how our method scales with the size of the problem. We compare the mean error and processing time obtained by our method with such measures computed using other methods. Our method produces results in competitive times with almost the same accuracy, especially for large meshes. We also demonstrate its use for solving two classical geometry processing problems: the regular sampling problem and the Voronoi tessellation on meshes.Comment: Preprint submitted to Computers & Graphic

arXiv.org e-Print Archive

ZORA

Distributed Estimation of Graph 4-Profiles

Author: Borokhovich Michael
Dimakis Alexandros G.
Elenberg Ethan R.
Shanmugam Karthikeyan
Publication venue
Publication date: 04/04/2016
Field of study

We present a novel distributed algorithm for counting all four-node induced subgraphs in a big graph. These counts, called the

4

-profile, describe a graph's connectivity properties and have found several uses ranging from bioinformatics to spam detection. We also study the more complicated problem of estimating the local

4

-profiles centered at each vertex of the graph. The local

4

-profile embeds every vertex in an

11

-dimensional space that characterizes the local geometry of its neighborhood: vertices that connect different clusters will have different local

4

-profiles compared to those that are only part of one dense cluster. Our algorithm is a local, distributed message-passing scheme on the graph and computes all the local

4

-profiles in parallel. We rely on two novel theoretical contributions: we show that local

4

-profiles can be calculated using compressed two-hop information and also establish novel concentration results that show that graphs can be substantially sparsified and still retain good approximation quality for the global

4

-profile. We empirically evaluate our algorithm using a distributed GraphLab implementation that we scaled up to

640

cores. We show that our algorithm can compute global and local

4

-profiles of graphs with millions of edges in a few minutes, significantly improving upon the previous state of the art.Comment: To appear in part at WWW'1

arXiv.org e-Print Archive

Crossref