260 research outputs found
Communication-optimal Parallel and Sequential Cholesky Decomposition
Numerical algorithms have two kinds of costs: arithmetic and communication,
by which we mean either moving data between levels of a memory hierarchy (in
the sequential case) or over a network connecting processors (in the parallel
case). Communication costs often dominate arithmetic costs, so it is of
interest to design algorithms minimizing communication. In this paper we first
extend known lower bounds on the communication cost (both for bandwidth and for
latency) of conventional (O(n^3)) matrix multiplication to Cholesky
factorization, which is used for solving dense symmetric positive definite
linear systems. Second, we compare the costs of various Cholesky decomposition
implementations to these lower bounds and identify the algorithms and data
structures that attain them. In the sequential case, we consider both the
two-level and hierarchical memory models. Combined with prior results in [13,
14, 15], this gives a set of communication-optimal algorithms for O(n^3)
implementations of the three basic factorizations of dense linear algebra: LU
with pivoting, QR and Cholesky. But it goes beyond this prior work on
sequential LU by optimizing communication for any number of levels of memory
hierarchy.Comment: 29 pages, 2 tables, 6 figure
Recommended from our members
Solving large scale linear programming problems
The interior point method (IPM) is now well established as a computationaly com-petitive scheme for solving very large scale linear programming problems. The leading variant of the IPM is the primal dual predictor corrector algorithm due to Mehrotra. The main computational efforts in this algorithm are the repeated calculation and solution of a large sparse positive definite system of equations.
We describe an implementation of this algorithm for vector processors. At the heart of the implementation is a vectorized matrix multiplication and Cholesky factorization for sparse matrices.
We identify the parts where vectorization can be beneficial and discuss in details the merits of alternative vectorization techniques. We show that the best way to utilize a vector processor is by exploiting dense computation within the sparse framework and by unrolling loop operations. We further present an extended definition of supernodes, and describe an implementation based on this new approach. We show that although this approach requires more memory it can increase the scope of dense computation substantially with out adding extra operations.
Performance results on standard industrial test problems and comparison between an algorithm that utilizes the extended supernodes and one that utilizes standard supernodes are presented and discussed
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
As multicore systems continue to gain ground in the High Performance
Computing world, linear algebra algorithms have to be reformulated or new
algorithms have to be developed in order to take advantage of the architectural
features on these new processors. Fine grain parallelism becomes a major
requirement and introduces the necessity of loose synchronization in the
parallel execution of an operation. This paper presents an algorithm for the
Cholesky, LU and QR factorization where the operations can be represented as a
sequence of small tasks that operate on square blocks of data. These tasks can
be dynamically scheduled for execution based on the dependencies among them and
on the availability of computational resources. This may result in an out of
order execution of the tasks which will completely hide the presence of
intrinsically sequential tasks in the factorization. Performance comparisons
are presented with the LAPACK algorithms where parallelism can only be
exploited at the level of the BLAS operations and vendor implementations
Highly parallel sparse Cholesky factorization
Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Solving large dense linear least squares problems on parallel distributed computers. Application to the Earth's gravity field computation.
Dans cette thèse, nous présentons le résultat de nos recherches dans le domaine du calcul scientifique haute performance pour les moindres carrés linéaires. En particulier, nous nous intéressons au développement de logiciels parallèles efficaces permettant de traiter des problèmes de moindres carrés denses de très grande taille. Nous fournissons également des outils numériques permettant d'étudier la qualité de la solution. Cette thèse est aussi une contribution au projet GOCE1 dont l'objectif est de fournir un modèle très précis du champ de gravité terrestre. Le lancement de ce satellite est prévu pour 2007 et à cet égard, notre travail constitue une étape dans la définition d'algorithmes pour ce projet. Nous présentons d'abord les stratégies numériques susceptibles d'être utilisées pour mettre à jour la solution en prenant en compte des nouvelles observations fournies par GOCE. Puis nous décrivons un solveur parallèle distribué que nous avons développé afin d'être intégré dans le logiciel du CNES2 chargé de la détermination d'orbite et du calcul de champ de gravité. Les performances de notre solveur sont compétitives par rapport à celles des librairies parallèles standards ScaLAPACK et PLAPACK sur les machines opérationnelles utilisées dans l'industrie spatiale, tout en nécessitant un stockage mémoire deux fois moindre grâce à la prise en compte des symétries du problème. Afin d'améliorer le passage à l'échelle et la portabilité de notre solveur, nous définissons un format « packed » distribué qui repose sur des noyaux ScaLAPACK. Cette approche constitue une amélioration significative car il n'existe pas à ce jour de format « packed » distribué pour les matrices symétriques et triangulaires denses. Nous présentons les exemples pour la factorisation de Cholesky et la mise à jour d'une factorisation QR. Ce format peut être aisément étendu à d'autres opérations d'algèbre linéaire. Cette thèse propose enfin des résultats nouveaux dans le domaine de l'analyse de sensibilité des moindres carrés linéaires résultant de problèmes d'estimation de paramètres. Nous proposons notamment une formule exacte, des bornes précises et des estimateurs statistiques pour évaluer le conditionnement d'une fonction linéaire de la solution d'un problème de moindres carrés. Le choix entre ces différentes formules dépendra de la taille du problème et du niveau de précision souhaité. ABSTRACT : In this thesis, we present our research in high performance scientific computing for linear least squares. More precisely we are concerned with developing efficient parallel software that can solve very large dense linear least squares problems and with providing numerical tools that can assess the quality of the solution. This thesis is also a contribution to the GOCE3 mission that strives for a very accurate model of the Earth's gravity field. This satellite is scheduled for launch in 2007 and in this respect, our work represents a step in the definition of algorithms for the project. We present an overview of the numerical strategies that can be used for updating the solution with new observations coming from GOCE mesurements. Then we describe a parallel distributed solver that we implemented in order to be used in the CNES4 software package for orbit determination and gravity field computation. This solver compares well in terms of performance with the standard parallel libraries ScaLAPACK and PLAPACK on the operational platforms used in the space industry while saving about half the memory, thanks to taking into account the symmetry of the problem. In order to improve the scalability and the portability of our solver, we define a packed distributed format that is based on ScaLAPACK kernel routines. This approach is a significant improvement since there is no packed distributed format available for symmetric or triangular matrices in the existing dense parallel libraries. Examples are given for the Cholesky factorization and for the updating of a QR factorization. This format can easily be extended to other linear algebra calculations. This thesis also contains new results in the area of sensitivity analysis for linear least squares resulting from parameter estimation problems. Specifically we provide a closed formula, bounds of correct order of magnitude and also statistical estimates that enable us to evaluate the condition number of linear functionals of least squares solution. The choice between the different expressions will depend on the problem size and on the desired level of accuracy
A Case Study in Coordination Programming: Performance Evaluation of S-Net vs Intel's Concurrent Collections
We present a programming methodology and runtime performance case study
comparing the declarative data flow coordination language S-Net with Intel's
Concurrent Collections (CnC). As a coordination language S-Net achieves a
near-complete separation of concerns between sequential software components
implemented in a separate algorithmic language and their parallel orchestration
in an asynchronous data flow streaming network. We investigate the merits of
S-Net and CnC with the help of a relevant and non-trivial linear algebra
problem: tiled Cholesky decomposition. We describe two alternative S-Net
implementations of tiled Cholesky factorization and compare them with two CnC
implementations, one with explicit performance tuning and one without, that
have previously been used to illustrate Intel CnC. Our experiments on a 48-core
machine demonstrate that S-Net manages to outperform CnC on this problem.Comment: 9 pages, 8 figures, 1 table, accepted for PLC 2014 worksho
Programming Models\u27 Support for Heterogeneous Architecture
Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak computational capacity. Heterogeneous systems equipped with accelerators such as GPUs have become the most prominent components of High Performance Computing (HPC) systems. Even at the node level the significant heterogeneity of CPU and GPU, i.e. hardware and memory space differences, leads to challenges for fully exploiting such complex architectures. Extending outside the node scope, only escalate such challenges.
Conventional programming models such as data- ow and message passing have been widely adopted in HPC communities. When moving towards heterogeneous systems, the lack of GPU integration causes such programming models to struggle in handling the heterogeneity of different computing units, leading to sub-optimal performance and drastic decrease in developer productivity. To bridge the gap between underlying heterogeneous architectures and current programming paradigms, we propose to extend such programming paradigms with architecture awareness optimization.
Two programming models are used to demonstrate the impact of heterogeneous architecture awareness. The PaRSEC task-based runtime, an adopter of the data- ow model, provides opportunities for overlapping communications with computations and minimizing data movements, as well as dynamically adapting the work granularity to the capability of the hardware.
To fulfill the demand of an efficient and portable Message Passing Interface (MPI) implementation to communicate GPU data, a GPU-aware design is presented based on the Open MPI infrastructure supporting efficient point-to-point and collective communications of GPU-residential data, for both contiguous and non-contiguous memory layouts, by leveraging GPU network topology and hardware capabilities such as GPUDirect. The tight integration of GPU support in a widely used programming environment, free the developers from manually move data into/out of host memory before/after relying on MPI routines for communications, allowing them to focus instead on algorithmic optimizations.
Experimental results have confirmed that supported by such a tight and transparent integration, conventional programming models can once again take advantage of the state-of-the-art hardware and exhibit performance at the levels expected by the underlying hardware capabilities
- …