1,370 research outputs found
Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format
Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in
many areas like scientific computing and machine learning. However, existing
works under-look the performance optimization of SpDM on modern many-core
architectures like GPUs. The storage data structures help sparse matrices store
in a memory-saving format, but they bring difficulties in optimizing the
performance of SpDM on modern GPUs due to irregular data access of the sparse
structure, which results in lower resource utilization and poorer performance.
In this paper, we refer to the roofline performance model of GPUs to design an
efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global
memory access, fast shared memory reuse and more operations per byte of global
memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980,
GTX Titan X Pascal and Tesla P100) with CUDA-8.0 using a large number of
matrices including a public dataset and randomly generated matrices.
Experimental results show that GCOOSpDM achieves 1.5-8 speedup over
Nvidia's library cuSPARSE in many matrices. We also analyze instruction-level
operations on a particular GPU to understand the performance gap between
GCOOSpDM and cuSPARSE. The profiled instructions confirm that cuSPARSE spends a
lot of time on slow memory access (including DRAM access and L2 cache access),
while GCOOSpDM transfers such slow memory access to faster shared memory, which
mainly contributes to the performance gain. Results also show that GCOOSpDM
would outperform the dense algorithm (cuBLAS) with lower sparsity than cuSPARSE
on GPUs.Comment: 11 page
Structured Deep Neural Network Pruning via Matrix Pivoting
Deep Neural Networks (DNNs) are the key to the state-of-the-art machine
vision, sensor fusion and audio/video signal processing. Unfortunately, their
computation complexity and tight resource constraints on the Edge make them
hard to leverage on mobile, embedded and IoT devices. Due to great diversity of
Edge devices, DNN designers have to take into account the hardware platform and
application requirements during network training. In this work we introduce
pruning via matrix pivoting as a way to improve network pruning by compromising
between the design flexibility of architecture-oblivious and performance
efficiency of architecture-aware pruning, the two dominant techniques for
obtaining resource-efficient DNNs. We also describe local and global network
optimization techniques for efficient implementation of the resulting pruned
networks. In combination, the proposed pruning and implementation result in
close to linear speed up with the reduction of network coefficients during
pruning.Comment: 16 pages, 3 figures, 2 tables, 1 listin
Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments
Architectures with multiple classes of memory media are becoming a common
part of mainstream supercomputer deployments. So called multi-level memories
offer differing characteristics for each memory component including variation
in bandwidth, latency and capacity. This paper investigates the performance of
sparse matrix multiplication kernels on two leading high-performance computing
architectures -- Intel's Knights Landing processor and NVIDIA's Pascal GPU. We
describe a data placement method and a chunking-based algorithm for our kernels
that exploits the existence of the multiple memory spaces in each hardware
platform. We evaluate the performance of these methods w.r.t. standard
algorithms using the auto-caching mechanisms. Our results show that standard
algorithms that exploit cache reuse performed as well as multi-memory-aware
algorithms for architectures such as KNLs where the memory subsystems have
similar latencies. However, for architectures such as GPUs where memory
subsystems differ significantly in both bandwidth and latency,
multi-memory-aware methods are crucial for good performance. In addition, our
new approaches permit the user to run problems that require larger capacities
than the fastest memory of each compute node without depending on the
software-managed cache mechanisms
A Comparative Study on Exact Triangle Counting Algorithms on the GPU
We implement exact triangle counting in graphs on the GPU using three
different methodologies: subgraph matching to a triangle pattern; programmable
graph analytics, with a set-intersection approach; and a matrix formulation
based on sparse matrix-matrix multiplies. All three deliver best-of-class
performance over CPU implementations and over comparable GPU implementations,
with the graph-analytic approach achieving the best performance due to its
ability to exploit efficient filtering steps to remove unnecessary work and its
high-performance set-intersection core.Comment: 7 pages, 6 figures and 2 table
Parallel Triangular Solvers on GPU
In this paper, we investigate GPU based parallel triangular solvers
systematically. The parallel triangular solvers are fundamental to incomplete
LU factorization family preconditioners and algebraic multigrid solvers. We
develop a new matrix format suitable for GPU devices. Parallel lower triangular
solvers and upper triangular solvers are developed for this new data structure.
With these solvers, ILU preconditioners and domain decomposition
preconditioners are developed. Numerical results show that we can speed
triangular solvers around seven times faster
Sparse Matrix Multiplication On An Associative Processor
Sparse matrix multiplication is an important component of linear algebra
computations. Implementing sparse matrix multiplication on an associative
processor (AP) enables high level of parallelism, where a row of one matrix is
multiplied in parallel with the entire second matrix, and where the execution
time of vector dot product does not depend on the vector size. Four sparse
matrix multiplication algorithms are explored in this paper, combining AP and
baseline CPU processing to various levels. They are evaluated by simulation on
a large set of sparse matrices. The computational complexity of sparse matrix
multiplication on AP is shown to be an O(nnz) where nnz is the number of
nonzero elements. The AP is found to be especially efficient in binary sparse
matrix multiplication. AP outperforms conventional solutions in power
efficiency
StructADMM: A Systematic, High-Efficiency Framework of Structured Weight Pruning for DNNs
Weight pruning methods of DNNs have been demonstrated to achieve a good model
pruning rate without loss of accuracy, thereby alleviating the significant
computation/storage requirements of large-scale DNNs. Structured weight pruning
methods have been proposed to overcome the limitation of irregular network
structure and demonstrated actual GPU acceleration. However, in prior work the
pruning rate (degree of sparsity) and GPU acceleration are limited (to less
than 50%) when accuracy needs to be maintained. In this work,we overcome these
limitations by proposing a unified, systematic framework of structured weight
pruning for DNNs. It is a framework that can be used to induce different types
of structured sparsity, such as filter-wise, channel-wise, and shape-wise
sparsity, as well non-structured sparsity. The proposed framework incorporates
stochastic gradient descent with ADMM, and can be understood as a dynamic
regularization method in which the regularization target is analytically
updated in each iteration. Without loss of accuracy on the AlexNet model, we
achieve 2.58X and 3.65X average measured speedup on two GPUs, clearly
outperforming the prior work. The average speedups reach 3.15X and 8.52X when
allowing a moderate ac-curacy loss of 2%. In this case the model compression
for convolutional layers is 15.0X, corresponding to 11.93X measured CPU
speedup. Our experiments on ResNet model and on other data sets like UCF101 and
CIFAR-10 demonstrate the consistently higher performance of our framework
Load-Balanced Sparse MTTKRP on GPUs
Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most
computationally expensive kernels in sparse tensor computations. This work
focuses on optimizing the MTTKRP operation on GPUs, addressing both performance
and storage requirements. We begin by identifying the performance bottlenecks
in directly extending the state-of-the-art CSF (compressed sparse fiber) format
from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs
is that of utilizing the much greater degree of parallelism in a load-balanced
fashion for irregular computations like sparse MTTKRP. To address this issue,
we develop a new storage-efficient representation for tensors that enables
high-performance, load-balanced execution of MTTKRP on GPUs. A GPU
implementation of sparse MTTKRP using the new sparse tensor representation is
shown to outperform all currently known parallel sparse CPU and GPU MTTKRP
implementations
Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
Hierarchical matrices are space and time efficient representations of dense
matrices that exploit the low rank structure of matrix blocks at different
levels of granularity. The hierarchically low rank block partitioning produces
representations that can be stored and operated on in near-linear complexity
instead of the usual polynomial complexity of dense matrices. In this paper, we
present high performance implementations of matrix vector multiplication and
compression operations for the variant of hierarchical matrices
on GPUs. This variant exploits, in addition to the hierarchical block
partitioning, hierarchical bases for the block representations and results in a
scheme that requires only storage and complexity for the mat-vec
and compression kernels. These two operations are at the core of algebraic
operations for hierarchical matrices, the mat-vec being a ubiquitous operation
in numerical algorithms while compression/recompression represents a key
building block for other algebraic operations, which require periodic
recompression during execution. The difficulties in developing efficient GPU
algorithms come primarily from the irregular tree data structures that underlie
the hierarchical representations, and the key to performance is to recast the
computations on flattened trees in ways that allow batched linear algebra
operations to be performed. This requires marshaling the irregularly laid out
data in a way that allows them to be used by the batched routines. Marshaling
operations only involve pointer arithmetic with no data movement and as a
result have minimal overhead. Our numerical results on covariance matrices from
2D and 3D problems from spatial statistics show the high efficiency our
routines achieve---over 550GB/s for the bandwidth-limited mat-vec and over
850GFLOPS/s in sustained performance for the compression on the P100 Pascal
GPU
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many
sparse solvers. We investigate performance properties of spMVM with matrices of
various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded
jagged diagonals storage" (pJDS) format is proposed which may substantially
reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our
test scenarios the pJDS format cuts the overall spMVM memory footprint on the
GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance.
Using a suitable performance model we identify performance bottlenecks on the
node level that invalidate some types of matrix structures for efficient
multi-GPGPU parallelization. For appropriate sparsity patterns we extend
previous work on distributed-memory parallel spMVM to demonstrate a scalable
hybrid MPI-GPGPU code, achieving efficient overlap of communication and
computation.Comment: 10 pages, 5 figures. Added reference to other recent sparse matrix
format
- …