1,292 research outputs found
Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
Sympiler is a domain-specific code generator that optimizes sparse matrix
computations by decoupling the symbolic analysis phase from the numerical
manipulation stage in sparse codes. The computation patterns in sparse
numerical methods are guided by the input sparsity structure and the sparse
algorithm itself. In many real-world simulations, the sparsity pattern changes
little or not at all. Sympiler takes advantage of these properties to
symbolically analyze sparse codes at compile-time and to apply inspector-guided
transformations that enable applying low-level transformations to sparse codes.
As a result, the Sympiler-generated code outperforms highly-optimized matrix
factorization codes from commonly-used specialized libraries, obtaining average
speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
As multicore systems continue to gain ground in the High Performance
Computing world, linear algebra algorithms have to be reformulated or new
algorithms have to be developed in order to take advantage of the architectural
features on these new processors. Fine grain parallelism becomes a major
requirement and introduces the necessity of loose synchronization in the
parallel execution of an operation. This paper presents an algorithm for the
Cholesky, LU and QR factorization where the operations can be represented as a
sequence of small tasks that operate on square blocks of data. These tasks can
be dynamically scheduled for execution based on the dependencies among them and
on the availability of computational resources. This may result in an out of
order execution of the tasks which will completely hide the presence of
intrinsically sequential tasks in the factorization. Performance comparisons
are presented with the LAPACK algorithms where parallelism can only be
exploited at the level of the BLAS operations and vendor implementations
SoluciĂłn de Problemas Matriciales de âGran Escalaâ sobre Procesadores MultinĂșcleo y GPUs
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance
when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given
the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where
data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming
high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary
eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 Ă 100, 000
symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not
necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution
to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in
support of these claimsPocos son conscientes de que, para matrices grandes, muchos cĂĄlculos matriciales obtienen casi el mismo rendimiento
cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De
manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacĂłn correctas, codificar algoritmos
Out-of-Core (OOC) para operaciones de Ălgebra matricial densa (donde los datos residen en disco y tienen que moverse
explĂcitamente entre memoria principal y disco) no resulta mĂĄs difĂcil que codificar algoritmos de altas prestaciones para
matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 nĂșcleos
o un equipo con un procesador gråfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión
100,000 Ă 100,000 aproximadamente en una hora. AsĂ, para problemas que solĂan considerarse grandes, no es necesario
usar arquitecturas de memoria distribuida con grandes memorias si uno estĂĄ dispuesto a esperar un cierto tiempo para que
la soluciĂłn se obtenga en una arquitectura multihebra como un procesador multinĂșcleo o una GPU. Este trabajo presenta
evidencias que soportan tales afirmaciones
- âŠ