Search CORE

1,292 research outputs found

Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis

Author: Cheshmi Kazem
Dehnavi Maryam Mehri
Kamil Shoaib
Strout Michelle Mills
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/05/2017
Field of study

Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compile-time and to apply inspector-guided transformations that enable applying low-level transformations to sparse codes. As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page

arXiv.org e-Print Archive

Crossref

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Author: Buttari Alfredo
Dongarra Jack
Kurzak Jakub
Langou Julien
Publication venue
Publication date: 01/01/2007
Field of study

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

arXiv.org e-Print Archive

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

MIMS EPrints

The University of Manchester - Institutional Repository

Solución de Problemas Matriciales de “Gran Escala” sobre Procesadores Multinúcleo y GPUs

Author: Marqués-Andrés Mercedes
Quintana-Orti Enrique S.
Quintana-Ortí Gregorio
Van de Geijn Robert A.
Publication venue: Departament d' Enginyeria i Ciència dels Computadors, Universitat Jaume I
Publication date: 01/01/2009
Field of study

Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 × 100, 000 symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in support of these claimsPocos son conscientes de que, para matrices grandes, muchos cálculos matriciales obtienen casi el mismo rendimiento cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacón correctas, codificar algoritmos Out-of-Core (OOC) para operaciones de Álgebra matricial densa (donde los datos residen en disco y tienen que moverse explícitamente entre memoria principal y disco) no resulta más difícil que codificar algoritmos de altas prestaciones para matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 núcleos o un equipo con un procesador gráfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión 100,000 × 100,000 aproximadamente en una hora. Así, para problemas que solían considerarse grandes, no es necesario usar arquitecturas de memoria distribuida con grandes memorias si uno está dispuesto a esperar un cierto tiempo para que la solución se obtenga en una arquitectura multihebra como un procesador multinúcleo o una GPU. Este trabajo presenta evidencias que soportan tales afirmaciones

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I