Search CORE

260 research outputs found

Communication-optimal Parallel and Sequential Cholesky Decomposition

Author: Grey Ballard
Grey Ballard
James Demmel
James Demmel
Oded Schwartz
Oded Schwartz
Olga Holtz
Olga Holtz
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2009
Field of study

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.Comment: 29 pages, 2 tables, 6 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Recommended from our members

Solving large scale linear programming problems

Author: Levkovitz R
Publication venue: Brunel University
Publication date: 01/01/1993
Field of study

The interior point method (IPM) is now well established as a computationaly com-petitive scheme for solving very large scale linear programming problems. The leading variant of the IPM is the primal dual predictor corrector algorithm due to Mehrotra. The main computational efforts in this algorithm are the repeated calculation and solution of a large sparse positive definite system of equations. We describe an implementation of this algorithm for vector processors. At the heart of the implementation is a vectorized matrix multiplication and Cholesky factorization for sparse matrices. We identify the parts where vectorization can be beneficial and discuss in details the merits of alternative vectorization techniques. We show that the best way to utilize a vector processor is by exploiting dense computation within the sparse framework and by unrolling loop operations. We further present an extended definition of supernodes, and describe an implementation based on this new approach. We show that although this approach requires more memory it can increase the scope of dense computation substantially with out adding extra operations. Performance results on standard industrial test problems and comparison between an algorithm that utilizes the extended supernodes and one that utilizes standard supernodes are presented and discussed

Brunel University Research Archive

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Author: Buttari Alfredo
Dongarra Jack
Kurzak Jakub
Langou Julien
Publication venue
Publication date: 01/01/2007
Field of study

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

arXiv.org e-Print Archive

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

MIMS EPrints

The University of Manchester - Institutional Repository

Highly parallel sparse Cholesky factorization

Author: Gilbert John R.
Schreiber Robert
Publication venue
Publication date
Field of study

Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms

NASA Technical Reports Server

Minimizing Communication in Linear Algebra

Author: Blackford L. S.
Grey Ballard
James Demmel
Oded Schwartz
Olga Holtz
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2009
Field of study

In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional

O(n^3)

algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as

\Omega

(#arithmetic operations /

\sqrt{M}

), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization,

LDL^T

factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table

arXiv.org e-Print Archive

CiteSeerX

Crossref

Solving large dense linear least squares problems on parallel distributed computers. Application to the Earth's gravity field computation.

Author: Baboulin Marc
Publication venue: Institut National Polytechnique de Toulouse
Publication date: 21/03/2006
Field of study

Dans cette thèse, nous présentons le résultat de nos recherches dans le domaine du calcul scientifique haute performance pour les moindres carrés linéaires. En particulier, nous nous intéressons au développement de logiciels parallèles efficaces permettant de traiter des problèmes de moindres carrés denses de très grande taille. Nous fournissons également des outils numériques permettant d'étudier la qualité de la solution. Cette thèse est aussi une contribution au projet GOCE1 dont l'objectif est de fournir un modèle très précis du champ de gravité terrestre. Le lancement de ce satellite est prévu pour 2007 et à cet égard, notre travail constitue une étape dans la définition d'algorithmes pour ce projet. Nous présentons d'abord les stratégies numériques susceptibles d'être utilisées pour mettre à jour la solution en prenant en compte des nouvelles observations fournies par GOCE. Puis nous décrivons un solveur parallèle distribué que nous avons développé afin d'être intégré dans le logiciel du CNES2 chargé de la détermination d'orbite et du calcul de champ de gravité. Les performances de notre solveur sont compétitives par rapport à celles des librairies parallèles standards ScaLAPACK et PLAPACK sur les machines opérationnelles utilisées dans l'industrie spatiale, tout en nécessitant un stockage mémoire deux fois moindre grâce à la prise en compte des symétries du problème. Afin d'améliorer le passage à l'échelle et la portabilité de notre solveur, nous définissons un format « packed » distribué qui repose sur des noyaux ScaLAPACK. Cette approche constitue une amélioration significative car il n'existe pas à ce jour de format « packed » distribué pour les matrices symétriques et triangulaires denses. Nous présentons les exemples pour la factorisation de Cholesky et la mise à jour d'une factorisation QR. Ce format peut être aisément étendu à d'autres opérations d'algèbre linéaire. Cette thèse propose enfin des résultats nouveaux dans le domaine de l'analyse de sensibilité des moindres carrés linéaires résultant de problèmes d'estimation de paramètres. Nous proposons notamment une formule exacte, des bornes précises et des estimateurs statistiques pour évaluer le conditionnement d'une fonction linéaire de la solution d'un problème de moindres carrés. Le choix entre ces différentes formules dépendra de la taille du problème et du niveau de précision souhaité. ABSTRACT : In this thesis, we present our research in high performance scientific computing for linear least squares. More precisely we are concerned with developing efficient parallel software that can solve very large dense linear least squares problems and with providing numerical tools that can assess the quality of the solution. This thesis is also a contribution to the GOCE3 mission that strives for a very accurate model of the Earth's gravity field. This satellite is scheduled for launch in 2007 and in this respect, our work represents a step in the definition of algorithms for the project. We present an overview of the numerical strategies that can be used for updating the solution with new observations coming from GOCE mesurements. Then we describe a parallel distributed solver that we implemented in order to be used in the CNES4 software package for orbit determination and gravity field computation. This solver compares well in terms of performance with the standard parallel libraries ScaLAPACK and PLAPACK on the operational platforms used in the space industry while saving about half the memory, thanks to taking into account the symmetry of the problem. In order to improve the scalability and the portability of our solver, we define a packed distributed format that is based on ScaLAPACK kernel routines. This approach is a significant improvement since there is no packed distributed format available for symmetric or triangular matrices in the existing dense parallel libraries. Examples are given for the Cholesky factorization and for the updating of a QR factorization. This format can easily be extended to other linear algebra calculations. This thesis also contains new results in the area of sensitivity analysis for linear least squares resulting from parameter estimation problems. Specifically we provide a closed formula, bounds of correct order of magnitude and also statistical estimates that enable us to evaluate the condition number of linear functionals of least squares solution. The choice between the different expressions will depend on the problem size and on the desired level of accuracy

Open Archive Toulouse Archive Ouverte

Institut National Polytechnique de Toulouse (Theses)

A Case Study in Coordination Programming: Performance Evaluation of S-Net vs Intel's Concurrent Collections

Author: Gijsbers Bert
Grelck Clemens
Shafarenko Alex
Tveretina Olga
Zaichenkov Pavel
Publication venue
Publication date: 01/01/2014
Field of study

We present a programming methodology and runtime performance case study comparing the declarative data flow coordination language S-Net with Intel's Concurrent Collections (CnC). As a coordination language S-Net achieves a near-complete separation of concerns between sequential software components implemented in a separate algorithmic language and their parallel orchestration in an asynchronous data flow streaming network. We investigate the merits of S-Net and CnC with the help of a relevant and non-trivial linear algebra problem: tiled Cholesky decomposition. We describe two alternative S-Net implementations of tiled Cholesky factorization and compare them with two CnC implementations, one with explicit performance tuning and one without, that have previously been used to illustrate Intel CnC. Our experiments on a 48-core machine demonstrate that S-Net manages to outperform CnC on this problem.Comment: 9 pages, 8 figures, 1 table, accepted for PLC 2014 worksho

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

International Migration, Integration and Social Cohesion online publications

Programming Models\u27 Support for Heterogeneous Architecture

Author: Wu Wei
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/05/2017
Field of study

Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak computational capacity. Heterogeneous systems equipped with accelerators such as GPUs have become the most prominent components of High Performance Computing (HPC) systems. Even at the node level the significant heterogeneity of CPU and GPU, i.e. hardware and memory space differences, leads to challenges for fully exploiting such complex architectures. Extending outside the node scope, only escalate such challenges. Conventional programming models such as data- ow and message passing have been widely adopted in HPC communities. When moving towards heterogeneous systems, the lack of GPU integration causes such programming models to struggle in handling the heterogeneity of different computing units, leading to sub-optimal performance and drastic decrease in developer productivity. To bridge the gap between underlying heterogeneous architectures and current programming paradigms, we propose to extend such programming paradigms with architecture awareness optimization. Two programming models are used to demonstrate the impact of heterogeneous architecture awareness. The PaRSEC task-based runtime, an adopter of the data- ow model, provides opportunities for overlapping communications with computations and minimizing data movements, as well as dynamically adapting the work granularity to the capability of the hardware. To fulfill the demand of an efficient and portable Message Passing Interface (MPI) implementation to communicate GPU data, a GPU-aware design is presented based on the Open MPI infrastructure supporting efficient point-to-point and collective communications of GPU-residential data, for both contiguous and non-contiguous memory layouts, by leveraging GPU network topology and hardware capabilities such as GPUDirect. The tight integration of GPU support in a widely used programming environment, free the developers from manually move data into/out of host memory before/after relying on MPI routines for communications, allowing them to focus instead on algorithmic optimizations. Experimental results have confirmed that supported by such a tight and transparent integration, conventional programming models can once again take advantage of the state-of-the-art hardware and exhibit performance at the levels expected by the underlying hardware capabilities

University of Tennessee, Knoxville: Trace