11 research outputs found
Load-Balanced Sparse MTTKRP on GPUs
Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most
computationally expensive kernels in sparse tensor computations. This work
focuses on optimizing the MTTKRP operation on GPUs, addressing both performance
and storage requirements. We begin by identifying the performance bottlenecks
in directly extending the state-of-the-art CSF (compressed sparse fiber) format
from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs
is that of utilizing the much greater degree of parallelism in a load-balanced
fashion for irregular computations like sparse MTTKRP. To address this issue,
we develop a new storage-efficient representation for tensors that enables
high-performance, load-balanced execution of MTTKRP on GPUs. A GPU
implementation of sparse MTTKRP using the new sparse tensor representation is
shown to outperform all currently known parallel sparse CPU and GPU MTTKRP
implementations
Shared Memory Parallelization of MTTKRP for Dense Tensors
The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational
bottleneck for algorithms computing CP decompositions of tensors. In this
paper, we develop shared-memory parallel algorithms for MTTKRP involving dense
tensors. The algorithms cast nearly all of the computation as matrix operations
in order to use optimized BLAS subroutines, and they avoid reordering tensor
entries in memory. We benchmark sequential and parallel performance of our
implementations, demonstrating high sequential performance and efficient
parallel scaling. We use our parallel implementation to compute a CP
decomposition of a neuroimaging data set and achieve a speedup of up to
over existing parallel software.Comment: 10 pages, 27 figure
Software for Sparse Tensor Decomposition on Emerging Computing Architectures
In this paper, we develop software for decomposing sparse tensors that is
portable to and performant on a variety of multicore, manycore, and GPU
computing architectures. The result is a single code whose performance matches
optimized architecture-specific implementations. The key to a portable approach
is to determine multiple levels of parallelism that can be mapped in different
ways to different architectures, and we explain how to do this for the
matricized tensor times Khatri-Rao product (MTTKRP) which is the key kernel in
canonical polyadic tensor decomposition. Our implementation leverages the
Kokkos framework, which enables a single code to achieve high performance
across multiple architectures that differ in how they approach fine-grained
parallelism. We also introduce a new construct for portable thread-local
arrays, which we call compile-time polymorphic arrays. Not only are the
specifics of our approaches and implementation interesting for tuning tensor
computations, but they also provide a roadmap for developing other portable
high-performance codes. As a last step in optimizing performance, we modify the
MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce
atomic-write contention. We test the performance of our implementation on 16-
and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are
competitive with state-of-the-art architecture-specific codes while having the
advantage of being able to run on a variety of architectures
Parallel Nonnegative CP Decomposition of Dense Tensors
The CP tensor decomposition is a low-rank approximation of a tensor. We
present a distributed-memory parallel algorithm and implementation of an
alternating optimization method for computing a CP decomposition of dense
tensor data that can enforce nonnegativity of the computed low-rank factors.
The principal task is to parallelize the matricized-tensor times Khatri-Rao
product (MTTKRP) bottleneck subcomputation. The algorithm is computation
efficient, using dimension trees to avoid redundant computation across MTTKRPs
within the alternating method. Our approach is also communication efficient,
using a data distribution and parallel algorithm across a multidimensional
processor grid that can be tuned to minimize communication. We benchmark our
software on synthetic as well as hyperspectral image and neuroscience dynamic
functional connectivity data, demonstrating that our algorithm scales well to
100s of nodes (up to 4096 cores) and is faster and more general than the
currently available parallel software
PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite
Tensor methods have gained increasingly attention from various applications,
including machine learning, quantum chemistry, healthcare analytics, social
network analysis, data mining, and signal processing, to name a few. Sparse
tensors and their algorithms become critical to further improve the performance
of these methods and enhance the interpretability of their output. This work
presents a sparse tensor algorithm benchmark suite (PASTA) for single- and
multi-core CPUs. To the best of our knowledge, this is the first benchmark
suite for sparse tensor world. PASTA targets on: 1) helping application users
to evaluate different computer systems using its representative computational
workloads; 2) providing insights to better utilize existed computer
architecture and systems and inspiration for the future design. This benchmark
suite is publicly released https://gitlab.com/tensorworld/pasta
A model-driven approach for a new generation of adaptive libraries
Efficient high-performance libraries often expose multiple tunable parameters
to provide highly optimized routines. These can range from simple loop unroll
factors or vector sizes all the way to algorithmic changes, given that some
implementations can be more suitable for certain devices by exploiting hardware
characteristics such as local memories and vector units. Traditionally, such
parameters and algorithmic choices are tuned and then hard-coded for a specific
architecture and for certain characteristics of the inputs. However, emerging
applications are often data-driven, thus traditional approaches are not
effective across the wide range of inputs and architectures used in practice.
In this paper, we present a new adaptive framework for data-driven applications
which uses a predictive model to select the optimal algorithmic parameters by
training with synthetic and real datasets. We demonstrate the effectiveness of
a BLAS library and specifically on its matrix multiplication routine. We
present experimental results for two GPU architectures and show significant
performance gains of up to 3x (on a high-end NVIDIA Pascal GPU) and 2.5x (on an
embedded ARM Mali GPU) when compared to a traditionally optimized library.Comment: New detailed analysis will be provide
Stochastic Gradients for Large-Scale Tensor Decomposition
Tensor decomposition is a well-known tool for multiway data analysis. This
work proposes using stochastic gradients for efficient generalized canonical
polyadic (GCP) tensor decomposition of large-scale tensors. GCP tensor
decomposition is a recently proposed version of tensor decomposition that
allows for a variety of loss functions such as Bernoulli loss for binary data
or Huber loss for robust estimation. The stochastic gradient is formed from
randomly sampled elements of the tensor and is efficient because it can be
computed using the sparse matricized-tensor-times-Khatri-Rao product (MTTKRP)
tensor kernel. For dense tensors, we simply use uniform sampling. For sparse
tensors, we propose two types of stratified sampling that give precedence to
sampling nonzeros. Numerical results demonstrate the advantages of the proposed
approach and its scalability to large-scale problems
PLANC: Parallel Low Rank Approximation with Non-negativity Constraints
We consider the problem of low-rank approximation of massive dense
non-negative tensor data, for example to discover latent patterns in video and
imaging applications. As the size of data sets grows, single workstations are
hitting bottlenecks in both computation time and available memory. We propose a
distributed-memory parallel computing solution to handle massive data sets,
loading the input data across the memories of multiple nodes and performing
efficient and scalable parallel algorithms to compute the low-rank
approximation. We present a software package called PLANC (Parallel Low Rank
Approximation with Non-negativity Constraints), which implements our solution
and allows for extension in terms of data (dense or sparse, matrices or tensors
of any order), algorithm (e.g., from multiplicative updating techniques to
alternating direction method of multipliers), and architecture (we exploit GPUs
to accelerate the computation in this work).We describe our parallel
distributions and algorithms, which are careful to avoid unnecessary
communication and computation, show how to extend the software to include new
algorithms and/or constraints, and report efficiency and scalability results
for both synthetic and real-world data sets.Comment: arXiv admin note: text overlap with arXiv:1806.0798
Tensor Completion Algorithms in Big Data Analytics
Tensor completion is a problem of filling the missing or unobserved entries
of partially observed tensors. Due to the multidimensional character of tensors
in describing complex datasets, tensor completion algorithms and their
applications have received wide attention and achievement in areas like data
mining, computer vision, signal processing, and neuroscience. In this survey,
we provide a modern overview of recent advances in tensor completion algorithms
from the perspective of big data analytics characterized by diverse variety,
large volume, and high velocity. We characterize these advances from four
perspectives: general tensor completion algorithms, tensor completion with
auxiliary information (variety), scalable tensor completion algorithms
(volume), and dynamic tensor completion algorithms (velocity). Further, we
identify several tensor completion applications on real-world data-driven
problems and present some common experimental frameworks popularized in the
literature. Our goal is to summarize these popular methods and introduce them
to researchers and practitioners for promoting future research and
applications. We conclude with a discussion of key challenges and promising
research directions in this community for future exploration
A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs
Tensor computations present significant performance challenges that impact a
wide spectrum of applications ranging from machine learning, healthcare
analytics, social network analysis, data mining to quantum chemistry and signal
processing. Efforts to improve the performance of tensor computations include
exploring data layout, execution scheduling, and parallelism in common tensor
kernels. This work presents a benchmark suite for arbitrary-order sparse tensor
kernels using state-of-the-art tensor formats: coordinate (COO) and
hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of
reference tensor kernel implementations that are compatible with real-world
tensors and power law tensors extended from synthetic graph generation
techniques. We also propose Roofline performance models for these kernels to
provide insights of computer platforms from sparse tensor view.Comment: 13 pages, 7 figure