10,771 research outputs found
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
Sparse tensors appear in many large-scale applications with multidimensional
and sparse data. While multidimensional sparse data often need to be processed
on manycore processors, attempts to develop highly-optimized GPU-based
implementations of sparse tensor operations are rare. The irregular computation
patterns and sparsity structures as well as the large memory footprints of
sparse tensor operations make such implementations challenging. We leverage the
fact that sparse tensor operations share similar computation patterns to
propose a unified tensor representation called F-COO. Combined with
GPU-specific optimizations, F-COO provides highly-optimized implementations of
sparse tensor computations on GPUs. The performance of the proposed unified
approach is demonstrated for tensor-based kernels such as the Sparse Matricized
Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix
Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to
state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to
3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a
CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using
the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
- …