76 research outputs found
Dense matrix computations on NUMA architectures with distance-aware work stealing
We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.The authors would like to thank the National Institute for Computational Sciences for granting us access on the Nautilus system. The KAUST authors acknowledge support of the Extreme Computing Research Center. The BSC-affiliated authors thankfully acknowledges the support of the European Commission through the HiPEAC-3 Network of Excellence (FP7-ICT 287759), Intel-BSC Exascale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051) and the grant SEV-2011-00067 of the Severo Ochoa Program.Peer ReviewedPostprint (published version
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance
We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they be-
come rich in operations that can achieve near-peak performance on a modern processor. The key is a
novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular
vector matrix. This algorithm is then implemented with optimizations that (1) leverage vector instruction
units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of
memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian
eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigen-
vectors/singular vectors are computed. The approach yields vastly improved performance relative to the
traditional QR algorithms for these problems and is competitive with two commonly used alternatives—
Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—
while inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the
computations performed by the restructured algorithms remain essentially identical to those performed by
the original methods, robust numerical properties are preserved
Solving Dense Generalized Eigenproblems on Multi-threaded Architectures
We compare two approaches to compute a fraction of the spectrum of dense symmetric definite generalized eigenproblems: one is based on the reduction to tridiagonal form, and the other on the Krylov-subspace iteration. Two large-scale applications, arising in molecular dynamics and material science, are employed to investigate the contributions of the application, architecture, and parallelism of the method to the performance of the solvers. The experimental results on a state-of-the-art 8-core platform, equipped with a graphics processing unit (GPU), reveal that in realistic applications, iterative Krylov-subspace methods can be a competitive approach also for the solution of dense problems
Householder QR Factorization With Randomization for Column Pivoting (HQRRP)
A fundamental problem when adding column pivoting to the Householder QR fac-
torization is that only about half of the computation can be cast in terms of high performing matrix-
matrix multiplications, which greatly limits the bene ts that can be derived from so-called blocking
of algorithms. This paper describes a technique for selecting groups of pivot vectors by means of
randomized projections. It is demonstrated that the asymptotic
op count for the proposed method
is 2mn2 �����(2=3)n3 for an m n matrix, identical to that of the best classical unblocked Householder
QR factorization algorithm (with or without pivoting). Experiments demonstrate acceleration in
speed of close to an order of magnitude relative to the geqp3 function in LAPACK, when executed
on a modern CPU with multiple cores. Further, experiments demonstrate that the quality of the
randomized pivot selection strategy is roughly the same as that of classical column pivoting. The
described algorithm is made available under Open Source license and can be used with LAPACK or
libflame
Solución de Problemas Matriciales de “Gran Escala” sobre Procesadores Multinúcleo y GPUs
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance
when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given
the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where
data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming
high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary
eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 × 100, 000
symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not
necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution
to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in
support of these claimsPocos son conscientes de que, para matrices grandes, muchos cálculos matriciales obtienen casi el mismo rendimiento
cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De
manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacón correctas, codificar algoritmos
Out-of-Core (OOC) para operaciones de Álgebra matricial densa (donde los datos residen en disco y tienen que moverse
explícitamente entre memoria principal y disco) no resulta más difícil que codificar algoritmos de altas prestaciones para
matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 núcleos
o un equipo con un procesador gráfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión
100,000 × 100,000 aproximadamente en una hora. Así, para problemas que solían considerarse grandes, no es necesario
usar arquitecturas de memoria distribuida con grandes memorias si uno está dispuesto a esperar un cierto tiempo para que
la solución se obtenga en una arquitectura multihebra como un procesador multinúcleo o una GPU. Este trabajo presenta
evidencias que soportan tales afirmaciones
- …