2 research outputs found
Parallel Nonnegative CP Decomposition of Dense Tensors
The CP tensor decomposition is a low-rank approximation of a tensor. We
present a distributed-memory parallel algorithm and implementation of an
alternating optimization method for computing a CP decomposition of dense
tensor data that can enforce nonnegativity of the computed low-rank factors.
The principal task is to parallelize the matricized-tensor times Khatri-Rao
product (MTTKRP) bottleneck subcomputation. The algorithm is computation
efficient, using dimension trees to avoid redundant computation across MTTKRPs
within the alternating method. Our approach is also communication efficient,
using a data distribution and parallel algorithm across a multidimensional
processor grid that can be tuned to minimize communication. We benchmark our
software on synthetic as well as hyperspectral image and neuroscience dynamic
functional connectivity data, demonstrating that our algorithm scales well to
100s of nodes (up to 4096 cores) and is faster and more general than the
currently available parallel software
PLANC: Parallel Low Rank Approximation with Non-negativity Constraints
We consider the problem of low-rank approximation of massive dense
non-negative tensor data, for example to discover latent patterns in video and
imaging applications. As the size of data sets grows, single workstations are
hitting bottlenecks in both computation time and available memory. We propose a
distributed-memory parallel computing solution to handle massive data sets,
loading the input data across the memories of multiple nodes and performing
efficient and scalable parallel algorithms to compute the low-rank
approximation. We present a software package called PLANC (Parallel Low Rank
Approximation with Non-negativity Constraints), which implements our solution
and allows for extension in terms of data (dense or sparse, matrices or tensors
of any order), algorithm (e.g., from multiplicative updating techniques to
alternating direction method of multipliers), and architecture (we exploit GPUs
to accelerate the computation in this work).We describe our parallel
distributions and algorithms, which are careful to avoid unnecessary
communication and computation, show how to extend the software to include new
algorithms and/or constraints, and report efficiency and scalability results
for both synthetic and real-world data sets.Comment: arXiv admin note: text overlap with arXiv:1806.0798