Search CORE

8,125 research outputs found

Gunrock: A High-Performance Graph Processing Library on the GPU

Author: Cederman D.
Goel A.
Gonzalez J. E.
Gregor D.
Jia Y.
Low Y.
Pande P. R.
Siek J. G.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/01/2016
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

Scalable Breadth-First Search on a GPU Cluster

Author: Owens John D.
Pan Yuechao
Pearce Roger
Publication venue
Publication date: 13/03/2018
Field of study

On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.Comment: 12 pages, 13 figures. To appear at IPDPS 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Space Efficient Breadth-First and Level Traversals of Consistent Global States of Parallel Programs

Author: B Ganter
G Pruesse
G Steiner
KM Chandy
L Bianco
L Lamport
L Lamport
M Chein
M Habib
MM Sysło
S Alagar
S Alagar
T Ball
VK Garg
Publication venue
Publication date: 24/07/2017
Field of study

Enumerating consistent global states of a computation is a fundamental problem in parallel computing with applications to debug- ging, testing and runtime verification of parallel programs. Breadth-first search (BFS) enumeration is especially useful for these applications as it finds an erroneous consistent global state with the least number of events possible. The total number of executed events in a global state is called its rank. BFS also allows enumeration of all global states of a given rank or within a range of ranks. If a computation on n processes has m events per process on average, then the traditional BFS (Cooper-Marzullo and its variants) requires

\mathcal{O}(\frac{m^{n-1}}{n})

space in the worst case, whereas ou r algorithm performs the BFS requires

\mathcal{O}(m^2n^2)

space. Thus, we reduce the space complexity for BFS enumeration of consistent global states exponentially. and give the first polynomial space algorithm for this task. In our experimental evaluation of seven benchmarks, traditional BFS fails in many cases by exhausting the 2 GB heap space allowed to the JVM. In contrast, our implementation uses less than 60 MB memory and is also faster in many cases

arXiv.org e-Print Archive

Crossref

GPU peer-to-peer techniques applied to a cluster interconnect

Author: Ammendola Roberto
Bernaschi Massimo
Biagioni Andrea
Bisson Mauro
Cicero Francesca Lo
Fatica Massimiliano
Frezza Ottorino
Lonardo Alessandro
Mastrostefano Enrico
Paolucci Pier Stanislao
Rossetti Davide
Simula Francesco
Tosoratto Laura
Vicini Piero
Publication venue
Publication date: 31/07/2013
Field of study

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201

arXiv.org e-Print Archive

Crossref

Maximum Multipath Routing Throughput in Multirate Wireless Mesh Networks

Author: Cai Jianfei
Foh Chuan Heng
Qureshi Jalaluddin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/06/2014
Field of study

In this paper, we consider the problem of finding the maximum routing throughput between any pair of nodes in an arbitrary multirate wireless mesh network (WMN) using multiple paths. Multipath routing is an efficient technique to maximize routing throughput in WMN, however maximizing multipath routing throughput is a NP-complete problem due to the shared medium for electromagnetic wave transmission in wireless channel, inducing collision-free scheduling as part of the optimization problem. In this work, we first provide problem formulation that incorporates collision-free schedule, and then based on this formulation we design an algorithm with search pruning that jointly optimizes paths and transmission schedule. Though suboptimal, compared to the known optimal single path flow, we demonstrate that an efficient multipath routing scheme can increase the routing throughput by up to 100% for simple WMNs.Comment: This paper has been accepted for publication in IEEE 80th Vehicular Technology Conference, VTC-Fall 201

arXiv.org e-Print Archive

Crossref

Parametric shortest-path algorithms via tropical geometry

Author: Joswig Michael
Schröter Benjamin
Publication venue
Publication date: 07/07/2021
Field of study

We study parameterized versions of classical algorithms for computing shortest-path trees. This is most easily expressed in terms of tropical geometry. Applications include shortest paths in traffic networks with variable link travel times.Comment: 24 pages and 8 figure

arXiv.org e-Print Archive