2,055 research outputs found
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
Document Classification Systems in Heterogeneous Computing Environments
Datacenter workloads demand high throughput, low cost and power efficient solutions. In most data centers the operating costs dominates the infrastructure cost. The ever growing amounts of data and the critical need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogeneous CPU based implementations of document classification systems. Several heterogeneous systems were investigated in the past where CPUs were combined with GPUs and FPGAs as system accelerators. The increasing complexity of FPGAs made them an interesting device in the heterogeneous computing environments and on the other hand difficult to program using Hardware Description languages. We explore the trade-offs when using high level synthesis and low level synthesis when programming FPGAs. Using low level synthesis results in less hardware resource usage on FPGAs and also offers the higher throughput compared to using HLS tool. While using HLS tool different heterogeneous computing devices such as multicore CPU and GPU targeted. Through our implementation experience and empirical results for data centric applications, we conclude that we can achieve power efficient results for these set of applications by either using low level synthesis or high level synthesis for programming FPGAs
Characterizing Power and Energy Efficiency of Legion Data-Centric Runtime and Applications on Heterogeneous High-Performance Computing Systems
The traditional parallel programming models require programmers to explicitly specify parallelism and data movement of the underlying parallel mechanisms. Different from the traditional computation-centric programming, Legion provides a data-centric programming model for extracting parallelism and data movement. In this chapter, we aim to characterize the power and energy consumption of running HPC applications on Legion. We run benchmark applications on compute nodes equipped with both CPU and GPU, and measure the execution time, power consumption and CPU/GPU utilization. Additionally, we test the message passing interface (MPI) version of these applications and compare the performance and power consumption of high-performance computing (HPC) applications using the computation-centric and data-centric programming models. Experimental results indicate Legion applications outperforms MPI applications on both performance and energy efficiency, i.e., Legion applications can be 9.17 times as fast as MPI applications and use only 9.2% energy. Legion effectively explores the heterogeneous architecture and runs applications tasks on GPU. As far as we know, this is the first study to understand the power and energy consumption of Legion programming and runtime infrastructure. Our findings will enable HPC system designers and operators to develop and tune the performance of data-centric HPC applications with constraints on power and energy consumption
- …