8 research outputs found
Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
We explore the trade-offs of performing linear algebra using Apache Spark,
compared to traditional C and MPI implementations on HPC platforms. Spark is
designed for data analytics on cluster computing platforms with access to local
disks and is optimized for data-parallel tasks. We examine three widely-used
and important matrix factorizations: NMF (for physical plausability), PCA (for
its ubiquity) and CX (for data interpretability). We apply these methods to
TB-sized problems in particle physics, climate modeling and bioimaging. The
data matrices are tall-and-skinny which enable the algorithms to map
conveniently into Spark's data-parallel model. We perform scaling experiments
on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide
tuning guidance to obtain high performance
A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary
A nonnegative matrix factorization (NMF) can be computed efficiently under
the separability assumption, which asserts that all the columns of the given
input data matrix belong to the cone generated by a (small) subset of them. The
provably most robust methods to identify these conic basis columns are based on
nonnegative sparse regression and self dictionaries, and require the solution
of large-scale convex optimization problems. In this paper we study a
particular nonnegative sparse regression model with self dictionary. As opposed
to previously proposed models, this model yields a smooth optimization problem
where the sparsity is enforced through linear constraints. We show that the
Euclidean projection on the polyhedron defined by these constraints can be
computed efficiently, and propose a fast gradient method to solve our model. We
compare our algorithm with several state-of-the-art methods on synthetic data
sets and real-world hyperspectral images