403 research outputs found
Computing the R of the QR factorization of tall and skinny matrices using MPI_Reduce
A QR factorization of a tall and skinny matrix with n columns can be
represented as a reduction. The operation used along the reduction tree has in
input two n-by-n upper triangular matrices and in output an n-by-n upper
triangular matrix which is defined as the R factor of the two input matrices
stacked the one on top of the other. This operation is binary, associative, and
commutative. We can therefore leverage the MPI library capabilities by using
user-defined MPI operations and MPI_Reduce to perform this reduction. The
resulting code is compact and portable. In this context, the user relies on the
MPI library to select a reduction tree appropriate for the underlying
architecture
Minimizing Communication for Eigenproblems and the Singular Value Decomposition
Algorithms have two costs: arithmetic and communication. The latter
represents the cost of moving data, either between levels of a memory
hierarchy, or between processors over a network. Communication often dominates
arithmetic and represents a rapidly increasing proportion of the total cost, so
we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds
were presented on the amount of communication required for essentially all
-like algorithms for linear algebra, including eigenvalue problems and
the SVD. Conventional algorithms, including those currently implemented in
(Sca)LAPACK, perform asymptotically more communication than these lower bounds
require. In this paper we present parallel and sequential eigenvalue algorithms
(for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms
that do attain these lower bounds, and analyze their convergence and
communication costs.Comment: 43 pages, 11 figure
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
We explore the trade-offs of performing linear algebra using Apache Spark,
compared to traditional C and MPI implementations on HPC platforms. Spark is
designed for data analytics on cluster computing platforms with access to local
disks and is optimized for data-parallel tasks. We examine three widely-used
and important matrix factorizations: NMF (for physical plausability), PCA (for
its ubiquity) and CX (for data interpretability). We apply these methods to
TB-sized problems in particle physics, climate modeling and bioimaging. The
data matrices are tall-and-skinny which enable the algorithms to map
conveniently into Spark's data-parallel model. We perform scaling experiments
on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide
tuning guidance to obtain high performance
A parallel butterfly algorithm
The butterfly algorithm is a fast algorithm which approximately evaluates a
discrete analogue of the integral transform \int K(x,y) g(y) dy at large
numbers of target points when the kernel, K(x,y), is approximately low-rank
when restricted to subdomains satisfying a certain simple geometric condition.
In d dimensions with O(N^d) quasi-uniformly distributed source and target
points, when each appropriate submatrix of K is approximately rank-r, the
running time of the algorithm is at most O(r^2 N^d log N). A parallelization of
the butterfly algorithm is introduced which, assuming a message latency of
\alpha and per-process inverse bandwidth of \beta, executes in at most O(r^2
N^d/p log N + \beta r N^d/p + \alpha)log p) time using p processes. This
parallel algorithm was then instantiated in the form of the open-source
DistButterfly library for the special case where K(x,y)=exp(i \Phi(x,y)), where
\Phi(x,y) is a black-box, sufficiently smooth, real-valued phase function.
Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for
important classes of phase functions. Using quasi-uniform sources, hyperbolic
Radon transforms and an analogue of a 3D generalized Radon transform were
respectively observed to strong-scale from 1-node/16-cores up to
1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively.Comment: To appear in SIAM Journal on Scientific Computin
Multilevel communication optimal LU and QR factorizations for hierarchical platforms
This study focuses on the performance of two classical dense linear algebra
algorithms, the LU and the QR factorizations, on multilevel hierarchical
platforms. We first introduce a new model called Hierarchical Cluster Platform
(HCP), encapsulating the characteristics of such platforms. The focus is set on
reducing the communication requirements of studied algorithms at each level of
the hierarchy. Lower bounds on communications are therefore extended with
respect to the HCP model. We then introduce multilevel LU and QR algorithms
tailored for those platforms, and provide a detailed performance analysis. We
also provide a set of numerical experiments and performance predictions
demonstrating the need for such algorithms on large platforms
- …