60 research outputs found
Tiled QR factorization algorithms
This work revisits existing algorithms for the QR factorization of
rectangular matrices composed of p-by-q tiles, where p >= q. Within this
framework, we study the critical paths and performance of algorithms such as
Sameh and Kuck, Modi and Clarke, Greedy, and those found within PLASMA.
Although neither Modi and Clarke nor Greedy is optimal, both are shown to be
asymptotically optimal for all matrices of size p = q^2 f(q), where f is any
function such that \lim_{+\infty} f= 0. This novel and important complexity
result applies to all matrices where p and q are proportional, p = \lambda q,
with \lambda >= 1, thereby encompassing many important situations in practice
(least squares). We provide an extensive set of experiments that show the
superiority of the new algorithms for tall matrices
The Reverse Cuthill-McKee Algorithm in Distributed-Memory
Ordering vertices of a graph is key to minimize fill-in and data structure
size in sparse direct solvers, maximize locality in iterative solvers, and
improve performance in graph algorithms. Except for naturally parallelizable
ordering methods such as nested dissection, many important ordering methods
have not been efficiently mapped to distributed-memory architectures. In this
paper, we present the first-ever distributed-memory implementation of the
reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse
matrix. Our parallelization uses a two-dimensional sparse matrix decomposition.
We achieve high performance by decomposing the problem into a small number of
primitives and utilizing optimized implementations of these primitives. Our
implementation shows strong scaling up to 1024 cores for smaller matrices and
up to 4096 cores for larger matrices
A 3D Parallel Algorithm for QR Decomposition
Interprocessor communication often dominates the runtime of large matrix
computations. We present a parallel algorithm for computing QR decompositions
whose bandwidth cost (communication volume) can be decreased at the cost of
increasing its latency cost (number of messages). By varying a parameter to
navigate the bandwidth/latency tradeoff, we can tune this algorithm for
machines with different communication costs
The impact of cache misses on the performance of matrix product algorithms on multicore platforms
The multicore revolution is underway, bringing new chips introducing more complex memory architectures. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. In this paper, we aim at designing cache-aware algorithms that minimize the number of cache misses paid during the execution of the matrix product kernel on a multicore processor. We analytically show how to achieve the best possible tradeoff between shared and distributed caches. We implement and evaluate several algorithms on two multicore platforms, one equipped with one Xeon quadcore, and the second one enriched with a GPU. It turns out that the impact of cache misses is very different across both platforms, and we identify what are the main design parameters that lead to peak performance for each target hardware configuration.La révolution multi-coeur est en cours, qui voit l'arrivée de processeurs dotées d'une architecture mémoire complexe. Les algorithmes les plus classiques doivent être revisités pour prendre en compte la disposition hiérarchique de la mémoire. Dans ce rapport, nous étudions des algorithmes prenant en compte les caches de données qui minimisent le nombre de défauts de cache pendant l'exécution d'un produit de matrices sur un processeur multi-coeur. Nous montrons analytiquement comment obtenir le meilleur compromis entre les caches partagés et distribués. Nous proposons une implémentation pour évaluer ces algorithmes sur deux plates-formes multi-coeur, l'une équipé d'un processeur Xeon quadri-coeur, l'autre dotée d'un GPU. Il apparaît que l'impact des défauts de cache est très différent sur ces deux plates-formes, et nous identifions quels sont les principaux paramètres de conception qui conduisent aux performances maximales pour chacune de ces configurations matérielles
Multilevel communication optimal LU and QR factorizations for hierarchical platforms
This study focuses on the performance of two classical dense linear algebra
algorithms, the LU and the QR factorizations, on multilevel hierarchical
platforms. We first introduce a new model called Hierarchical Cluster Platform
(HCP), encapsulating the characteristics of such platforms. The focus is set on
reducing the communication requirements of studied algorithms at each level of
the hierarchy. Lower bounds on communications are therefore extended with
respect to the HCP model. We then introduce multilevel LU and QR algorithms
tailored for those platforms, and provide a detailed performance analysis. We
also provide a set of numerical experiments and performance predictions
demonstrating the need for such algorithms on large platforms
Ellsworth American : April 20, 1904
This paper generalizes the parallel selected inversion algorithm called
PSelInv to sparse non- symmetric matrices. We assume a general sparse matrix A
has been decomposed as PAQ = LU on a distributed memory parallel machine, where
L, U are lower and upper triangular matrices, and P, Q are permutation
matrices, respectively. The PSelInv method computes selected elements of A-1.
The selection is confined by the sparsity pattern of the matrix AT . Our
algorithm does not assume any symmetry properties of A, and our parallel
implementation is memory efficient, in the sense that the computed elements of
A-T overwrites the sparse matrix L+U in situ. PSelInv involves a large number
of collective data communication activities within different processor groups
of various sizes. In order to minimize idle time and improve load balancing,
tree-based asynchronous communication is used to coordinate all such collective
communication. Numerical results demonstrate that PSelInv can scale efficiently
to 6,400 cores for a variety of matrices.Comment: arXiv admin note: text overlap with arXiv:1404.044
On optimal tree traversals for sparse matrix factorization
12 pagesWe study the complexity of traversing tree-shaped workflows whose tasks require large I/O files. Such workflows typically arise in the multifrontal method of sparse matrix factorization. We target a classical two-level memory system, where the main memory is faster but smaller than the secondary memory. A task in the workflow can be processed if all its predecessors have been processed, and if its input and output files fit in the currently available main memory. The amount of available memory at a given time depends upon the ordering in which the tasks are executed. What is the minimum amount of main memory, over all postorder schemes, or over all possible traversals, that is needed for an in-core execution? We establish several complexity results that answer these questions. We propose a new, polynomial time, exact algorithm which runs faster than a reference algorithm. Next, we address the setting where the required memory renders a pure in-core solution unfeasible. In this setting, we ask the following question: what is the minimum amount of I/O that must be performed between the main memory and the secondary memory? We show that this latter problem is NP-hard, and propose efficient heuristics. All algorithms and heuristics are thoroughly evaluated on assembly trees arising in the context of sparse matrix factorizations
- …