2 research outputs found
Load-Balanced Sparse MTTKRP on GPUs
Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most
computationally expensive kernels in sparse tensor computations. This work
focuses on optimizing the MTTKRP operation on GPUs, addressing both performance
and storage requirements. We begin by identifying the performance bottlenecks
in directly extending the state-of-the-art CSF (compressed sparse fiber) format
from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs
is that of utilizing the much greater degree of parallelism in a load-balanced
fashion for irregular computations like sparse MTTKRP. To address this issue,
we develop a new storage-efficient representation for tensors that enables
high-performance, load-balanced execution of MTTKRP on GPUs. A GPU
implementation of sparse MTTKRP using the new sparse tensor representation is
shown to outperform all currently known parallel sparse CPU and GPU MTTKRP
implementations
On Optimizing Distributed Tucker Decomposition for Sparse Tensors
The Tucker decomposition generalizes the notion of Singular Value
Decomposition (SVD) to tensors, the higher dimensional analogues of matrices.
We study the problem of constructing the Tucker decomposition of sparse tensors
on distributed memory systems via the HOOI procedure, a popular iterative
method. The scheme used for distributing the input tensor among the processors
(MPI ranks) critically influences the HOOI execution time. Prior work has
proposed different distribution schemes: an offline scheme based on
sophisticated hypergraph partitioning method and simple, lightweight
alternatives that can be used real-time. While the hypergraph based scheme
typically results in faster HOOI execution time, being complex, the time taken
for determining the distribution is an order of magnitude higher than the
execution time of a single HOOI iteration. Our main contribution is a
lightweight distribution scheme, which achieves the best of both worlds. We
show that the scheme is near-optimal on certain fundamental metrics associated
with the HOOI procedure and as a result, near-optimal on the computational load
(FLOPs). Though the scheme may incur higher communication volume, the
computation time is the dominant factor and as the result, the scheme achieves
better performance on the overall HOOI execution time. Our experimental
evaluation on large real-life tensors (having up to 4 billion elements) shows
that the scheme outperforms the prior schemes on the HOOI execution time by a
factor of up to 3x. On the other hand, its distribution time is comparable to
the prior lightweight schemes and is typically lesser than the execution time
of a single HOOI iteration.Comment: Abridged version of the paper to appear in the proceedings of ICS'1