757 research outputs found
Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach
While it is well-known and acknowledged that the performance of graph
algorithms is heavily dependent on the input data, there has been surprisingly
little research to quantify and predict the impact the graph structure has on
performance. Parallel graph algorithms, running on many-core systems such as
GPUs, are no exception: most research has focused on how to efficiently
implement and tune different graph operations on a specific GPU. However, the
performance impact of the input graph has only been taken into account
indirectly as a result of the graphs used to benchmark the system.
In this work, we present a case study investigating how to use the properties
of the input graph to improve the performance of the breadth-first search (BFS)
graph traversal. To do so, we first study the performance variation of 15
different BFS implementations across 248 graphs. Using this performance data,
we show that significant speed-up can be achieved by combining the best
implementation for each level of the traversal. To make use of this
data-dependent optimization, we must correctly predict the relative performance
of algorithms per graph level, and enable dynamic switching to the optimal
algorithm for each level at runtime.
We use the collected performance data to train a binary decision tree, to
enable high-accuracy predictions and fast switching. We demonstrate empirically
that our decision tree is both fast enough to allow dynamic switching between
implementations, without noticeable overhead, and accurate enough in its
prediction to enable significant BFS speedup. We conclude that our model-driven
approach (1) enables BFS to outperform state of the art GPU algorithms, and (2)
can be adapted for other BFS variants, other algorithms, or more specific
datasets
Analysing the Performance of GPU Hash Tables for State Space Exploration
In the past few years, General Purpose Graphics Processors (GPUs) have been
used to significantly speed up numerous applications. One of the areas in which
GPUs have recently led to a significant speed-up is model checking. In model
checking, state spaces, i.e., large directed graphs, are explored to verify
whether models satisfy desirable properties. GPUexplore is a GPU-based model
checker that uses a hash table to efficiently keep track of already explored
states. As a large number of states is discovered and stored during such an
exploration, the hash table should be able to quickly handle many inserts and
queries concurrently. In this paper, we experimentally compare two different
hash tables optimised for the GPU, one being the GPUexplore hash table, and the
other using Cuckoo hashing. We compare the performance of both hash tables
using random and non-random data obtained from model checking experiments, to
analyse the applicability of the two hash tables for state space exploration.
We conclude that Cuckoo hashing is three times faster than GPUexplore hashing
for random data, and that Cuckoo hashing is five to nine times faster for
non-random data. This suggests great potential to further speed up GPUexplore
in the near future.Comment: In Proceedings GaM 2017, arXiv:1712.0834
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix
multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the
popular compressed-sparse-row (CSR) format and thus do not require expensive
format conversion. While previous SpMM work concentrates on thread-level
parallelism, we additionally focus on latency hiding with instruction-level
parallelism and load-balancing. We show, both theoretically and experimentally,
that the proposed SpMM is a better fit for the GPU than previous approaches. We
identify a key memory access pattern that allows efficient access into both
input and output matrices that is crucial to getting excellent performance on
SpMM. By combining these two ingredients---(i) merge-based load-balancing and
(ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and
a 31.7% geomean speedup over state-of-the-art SpMM implementations on
real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel
and Distributed Computing (Euro-Par) 201
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
Sparse tensors appear in many large-scale applications with multidimensional
and sparse data. While multidimensional sparse data often need to be processed
on manycore processors, attempts to develop highly-optimized GPU-based
implementations of sparse tensor operations are rare. The irregular computation
patterns and sparsity structures as well as the large memory footprints of
sparse tensor operations make such implementations challenging. We leverage the
fact that sparse tensor operations share similar computation patterns to
propose a unified tensor representation called F-COO. Combined with
GPU-specific optimizations, F-COO provides highly-optimized implementations of
sparse tensor computations on GPUs. The performance of the proposed unified
approach is demonstrated for tensor-based kernels such as the Sparse Matricized
Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix
Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to
state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to
3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a
CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using
the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs
- …