13 research outputs found

    Partitioning, Ordering, and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver

    Get PDF
    Institut National Polytechnique de Toulouse, RT-APO-12-2PDSLin is a general-purpose algebraic parallel hybrid (direct/iterative) linear solver based on the Schur complement method. The most challenging step of the solver is the computation of a preconditioner based on an approximate global Schur complement. We investigate two combinatorial problems to enhance PDSLin's performance at this step. The first is a multi-constraint partitioning problem to balance the workload while computing the preconditioner in parallel. For this, we describe and evaluate a number of graph and hypergraph partitioning algorithms to satisfy our particular objective and constraints. The second problem is to reorder the sparse right-hand side vectors to improve the data access locality during the parallel solution of a sparse triangular system with multiple right-hand sides. This is to speed up the process of eliminating the unknowns associated with the interface. We study two reordering techniques: one based on a postordering of the elimination tree and the other based on a hypergraph partitioning. To demonstrate the effect of these techniques on the performance of PDSLin, we present the numerical results of solving large-scale linear systems arising from two applications of our interest: numerical simulations of modeling accelerator cavities and of modeling fusion devices

    An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

    Full text link
    We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

    Hypergraph-based Unsymmetric Nested Dissection Ordering for Sparse LU Factorization

    Get PDF
    In this paper we present HUND, a hypergraph-based unsymmetric nested dissection ordering algorithm for reducing the fill-in incurred during Gaussian elimination. HUND has several important properties. It takes a global perspective of the entire matrix, as opposed to local heuristics. It takes into account the assymetry of the input matrix by using a hypergraph to represent its structure. It is suitable for performing Gaussian elimination in parallel, with partial pivoting. This is possible because the row permutations performed due to partial pivoting do not destroy the column separators identified by the nested dissection approach. Experimental results on 27 medium and large size highly unsymmetric matrices compare HUND to four other well-known reordering algorithms. The results show that HUND provides a robust reordering algorithm, in the sense that it is the best or close to the best (often within 10%10\%) of all the other methods

    ProblÚmes de mémoire et de performance de la factorisation multifrontale parallÚle et de la résolution triangulaire à seconds membres creux

    Get PDF
    We consider the solution of very large sparse systems of linear equations on parallel architectures. In this context, memory is often a bottleneck that prevents or limits the use of direct solvers, especially those based on the multifrontal method. This work focuses on memory and performance issues of the two memory and computationally intensive phases of direct methods, namely, the numerical factorization and the solution phase. In the first part we consider the solution phase with sparse right-hand sides, and in the second part we consider the memory scalability of the multifrontal factorization. In the first part, we focus on the triangular solution phase with multiple sparse right-hand sides, that appear in numerous applications. We especially emphasize the computation of entries of the inverse, where both the right-hand sides and the solution are sparse. We first present several storage schemes that enable a significant compression of the solution space, both in a sequential and a parallel context. We then show that the way the right-hand sides are partitioned into blocks strongly influences the performance and we consider two different settings: the out-of-core case, where the aim is to reduce the number of accesses to the factors, that are stored on disk, and the in-core case, where the aim is to reduce the computational cost. Finally, we show how to enhance the parallel efficiency. In the second part, we consider the parallel multifrontal factorization. We show that controlling the active memory specific to the multifrontal method is critical, and that commonly used mapping techniques usually fail to do so: they cannot achieve a high memory scalability, i.e., they dramatically increase the amount of memory needed by the factorization when the number of processors increases. We propose a class of "memory-aware" mapping and scheduling algorithms that aim at maximizing performance while enforcing a user-given memory constraint and provide robust memory estimates before the factorization. These techniques have raised performance issues in the parallel dense kernels used at each step of the factorization, and we have proposed some algorithmic improvements. The ideas presented throughout this study have been implemented within the MUMPS (MUltifrontal Massively Parallel Solver) solver and experimented on large matrices (up to a few tens of millions unknowns) and massively parallel architectures (up to a few thousand cores). They have demonstrated to improve the performance and the robustness of the code, and will be available in a future release. Some of the ideas presented in the first part have also been implemented within the PDSLin (Parallel Domain decomposition Schur complement based Linear solver) package.Nous nous intĂ©ressons Ă  la rĂ©solution de systĂšmes linĂ©aires creux de trĂšs grande taille sur des machines parallĂšles. Dans ce contexte, la mĂ©moire est un facteur qui limite voire empĂȘche souvent l'utilisation de solveurs directs, notamment ceux basĂ©s sur la mĂ©thode multifrontale. Cette Ă©tude se concentre sur les problĂšmes de mĂ©moire et de performance des deux phases des mĂ©thodes directes les plus coĂ»teuses en mĂ©moire et en temps : la factorisation numĂ©rique et la rĂ©solution triangulaire. Dans une premiĂšre partie nous nous intĂ©ressons Ă  la phase de rĂ©solution Ă  seconds membres creux, puis, dans une seconde partie, nous nous intĂ©ressons Ă  la scalabilitĂ© mĂ©moire de la factorisation multifrontale. La premiĂšre partie de cette Ă©tude se concentre sur la rĂ©solution triangulaire Ă  seconds membres creux, qui apparaissent dans de nombreuses applications. En particulier, nous nous intĂ©ressons au calcul d'entrĂ©es de l'inverse d'une matrice creuse, oĂč les seconds membres et les vecteurs solutions sont tous deux creux. Nous prĂ©sentons d'abord plusieurs schĂ©mas de stockage qui permettent de rĂ©duire significativement l'espace mĂ©moire utilisĂ© lors de la rĂ©solution, dans le cadre d'exĂ©cutions sĂ©quentielles et parallĂšles. Nous montrons ensuite que la façon dont les seconds membres sont regroupĂ©s peut fortement influencer la performance et nous considĂ©rons deux cadres diffĂ©rents : le cas "hors-mĂ©moire" (out-of-core) oĂč le but est de rĂ©duire le nombre d'accĂšs aux facteurs stockĂ©s sur disque, et le cas "en mĂ©moire" (in-core) oĂč le but est de rĂ©duire le nombre d'opĂ©rations. Finalement, nous montrons comment amĂ©liorer le parallĂ©lisme. Dans la seconde partie, nous nous intĂ©ressons Ă  la factorisation multifrontale parallĂšle. Nous montrons tout d'abord que contrĂŽler la mĂ©moire active spĂ©cifique Ă  la mĂ©thode multifrontale est crucial, et que les techniques de "rĂ©partition" (mapping) classiques ne peuvent fournir une bonne scalabilitĂ© mĂ©moire : le coĂ»t mĂ©moire de la factorisation augmente fortement avec le nombre de processeurs. Nous proposons une classe d'algorithmes de rĂ©partition et d'ordonnancement "conscients de la mĂ©moire" (memory-aware) qui cherchent Ă  maximiser la performance tout en respectant une contrainte mĂ©moire fournie par l'utilisateur. Ces techniques ont rĂ©vĂ©lĂ© des problĂšmes de performances dans certains des noyaux parallĂšles denses utilisĂ©s Ă  chaque Ă©tape de la factorisation, et nous avons proposĂ© plusieurs amĂ©liorations algorithmiques. Les idĂ©es prĂ©sentĂ©es tout au long de cette Ă©tude ont Ă©tĂ© implantĂ©es dans le solveur MUMPS (Solveur MUltifrontal Massivement ParallĂšle) et expĂ©rimentĂ©es sur des matrices de grande taille (plusieurs dizaines de millions d'inconnues) et sur des machines massivement parallĂšles (jusqu'Ă  quelques milliers de coeurs). Elles ont permis d'amĂ©liorer les performances et la robustesse du code et seront disponibles dans une prochaine version. Certaines des idĂ©es prĂ©sentĂ©es dans la premiĂšre partie ont Ă©galement Ă©tĂ© implantĂ©es dans le solveur PDSLin (solveur linĂ©aire hybride basĂ© sur une mĂ©thode de complĂ©ment de Schur)

    Memory and performance issues in parallel multifrontal factorizations and triangular solutions with sparse right-hand sides

    Get PDF
    Nous nous intĂ©ressons Ă  la rĂ©solution de systĂšmes linĂ©aires creux de trĂšs grande taille sur des machines parallĂšles. Dans ce contexte, la mĂ©moire est un facteur qui limite voire empĂȘche souvent l’utilisation de solveurs directs, notamment ceux basĂ©s sur la mĂ©thode multifrontale. Cette Ă©tude se concentre sur les problĂšmes de mĂ©moire et de performance des deux phases des mĂ©thodes directes les plus coĂ»teuses en mĂ©moire et en temps : la factorisation numĂ©rique et la rĂ©solution triangulaire. Dans une premiĂšre partie nous nous intĂ©ressons Ă  la phase de rĂ©solution Ă  seconds membres creux, puis, dans une seconde partie, nous nous intĂ©ressons Ă  la scalabilitĂ© mĂ©moire de la factorisation multifrontale. La premiĂšre partie de cette Ă©tude se concentre sur la rĂ©solution triangulaire Ă  seconds membres creux, qui apparaissent dans de nombreuses applications. En particulier, nous nous intĂ©ressons au calcul d’entrĂ©es de l’inverse d’une matrice creuse, oĂč les seconds membres et les vecteurs solutions sont tous deux creux. Nous prĂ©sentons d’abord plusieurs schĂ©mas de stockage qui permettent de rĂ©duire significativement l’espace mĂ©moire utilisĂ© lors de la rĂ©solution, dans le cadre d’exĂ©cutions sĂ©quentielles et parallĂšles. Nous montrons ensuite que la façon dont les seconds membres sont regroupĂ©s peut fortement influencer la performance et nous considĂ©rons deux cadres diffĂ©rents : le cas "hors-mĂ©moire" (out-of-core) oĂč le but est de rĂ©duire le nombre d’accĂšs aux facteurs, qui sont stockĂ©s sur disque, et le cas "en mĂ©moire" (in-core) oĂč le but est de rĂ©duire le nombre d’opĂ©rations. Finalement, nous montrons comment amĂ©liorer le parallĂ©lisme. Dans la seconde partie, nous nous intĂ©ressons Ă  la factorisation multifrontale parallĂšle. Nous montrons tout d’abord que contrĂŽler la mĂ©moire active spĂ©cifique Ă  la mĂ©thode multifrontale est crucial, et que les technique de "rĂ©partition" (mapping) classiques ne peuvent fournir une bonne scalabilitĂ© mĂ©moire : le coĂ»t mĂ©moire de la factorisation augmente fortement avec le nombre de processeurs. Nous proposons une classe d’algorithmes de rĂ©partition et d’ordonnancement "conscients de la mĂ©moire" (memory-aware) qui cherchent Ă  maximiser la performance tout en respectant une contrainte mĂ©moire fournie par l’utilisateur. Ces techniques ont rĂ©vĂ©lĂ© des problĂšmes de performances dans certains des noyaux parallĂšles denses utilisĂ©s Ă  chaque Ă©tape de la factorisation, et nous avons proposĂ© plusieurs amĂ©liorations algorithmiques. Les idĂ©es prĂ©sentĂ©es tout au long de cette Ă©tude ont Ă©tĂ© implantĂ©es dans le solveur MUMPS (Solveur MUltifrontal Massivement ParallĂšle) et expĂ©rimentĂ©es sur des matrices de grande taille (plusieurs dizaines de millions d’inconnues) et sur des machines massivement parallĂšles (jusqu’à quelques milliers de coeurs). Elles ont permis d’amĂ©liorer les performances et la robustesse du code et seront disponibles dans une prochaine version. Certaines des idĂ©es prĂ©sentĂ©es dans la premiĂšre partie ont Ă©galement Ă©tĂ© implantĂ©es dans le solveur PDSLin (solveur linĂ©aire hybride basĂ© sur une mĂ©thode de complĂ©ment de Schur). ABSTRACT : We consider the solution of very large sparse systems of linear equations on parallel architectures. In this context, memory is often a bottleneck that prevents or limits the use of direct solvers, especially those based on the multifrontal method. This work focuses on memory and performance issues of the two memory and computationally intensive phases of direct methods, that is, the numerical factorization and the solution phase. In the first part we consider the solution phase with sparse right-hand sides, and in the second part we consider the memory scalability of the multifrontal factorization. In the first part, we focus on the triangular solution phase with multiple sparse right-hand sides, that appear in numerous applications. We especially emphasize the computation of entries of the inverse, where both the right-hand sides and the solution are sparse. We first present several storage schemes that enable a significant compression of the solution space, both in a sequential and a parallel context. We then show that the way the right-hand sides are partitioned into blocks strongly influences the performance and we consider two different settings: the out-of-core case, where the aim is to reduce the number of accesses to the factors, that are stored on disk, and the in-core case, where the aim is to reduce the computational cost. Finally, we show how to enhance the parallel efficiency. In the second part, we consider the parallel multifrontal factorization. We show that controlling the active memory specific to the multifrontal method is critical, and that commonly used mapping techniques usually fail to do so: they cannot achieve a high memory scalability, i.e. they dramatically increase the amount of memory needed by the factorization when the number of processors increases. We propose a class of "memory-aware" mapping and scheduling algorithms that aim at maximizing performance while enforcing a user-given memory constraint and provide robust memory estimates before the factorization. These techniques have raised performance issues in the parallel dense kernels used at each step of the factorization, and we have proposed some algorithmic improvements. The ideas presented throughout this study have been implemented within the MUMPS (MUltifrontal Massively Parallel Solver) solver and experimented on large matrices (up to a few tens of millions unknowns) and massively parallel architectures (up to a few thousand cores). They have demonstrated to improve the performance and the robustness of the code, and will be available in a future release. Some of the ideas presented in the first part have also been implemented within the PDSLin (Parallel Domain decomposition Schur complement based Linear solver) solver

    A Graph Approach to Observability in Physical Sparse Linear Systems

    Get PDF
    A sparse linear system constitutes a valid model for a broad range of physical systems, such as electric power networks, industrial processes, control systems or traffic models. The physical magnitudes in those systems may be directly measured by means of sensor networks that, in conjunction with data obtained from contextual and boundary constraints, allow the estimation of the state of the systems. The term observability refers to the capability of estimating the state variables of a system based on the available information. In the case of linear systems, diffierent graphical approaches were developed to address this issue. In this paper a new unified graph based technique is proposed in order to determine the observability of a sparse linear physical system or, at least, a system that can be linearized after a first order derivative, using a given sensor set. A network associated to a linear equation system is introduced, which allows addressing and solving three related problems: the characterization of those cases for which algebraic and topological observability analysis return contradictory results; the characterization of a necessary and sufficient condition for topological observability; the determination of the maximum observable subsystem in case of unobservability. Two examples illustrate the developed techniques

    High-performance direct solution of finite element problems on multi-core processors

    Get PDF
    A direct solution procedure is proposed and developed which exploits the parallelism that exists in current symmetric multiprocessing (SMP) multi-core processors. Several algorithms are proposed and developed to improve the performance of the direct solution of FE problems. A high-performance sparse direct solver is developed which allows experimentation with the newly developed and existing algorithms. The performance of the algorithms is investigated using a large set of FE problems. Furthermore, operation count estimations are developed to further assess various algorithms. An out-of-core version of the solver is developed to reduce the memory requirements for the solution. I/O is performed asynchronously without blocking the thread that makes the I/O request. Asynchronous I/O allows overlapping factorization and triangular solution computations with I/O. The performance of the developed solver is demonstrated on a large number of test problems. A problem with nearly 10 million degree of freedoms is solved on a low price desktop computer using the out-of-core version of the direct solver. Furthermore, the developed solver usually outperforms a commonly used shared memory solver.Ph.D.Committee Chair: Will, Kenneth; Committee Member: Emkin, Leroy; Committee Member: Kurc, Ozgur; Committee Member: Vuduc, Richard; Committee Member: White, Donal

    High performance Cholesky and symmetric indefinite factorizations with applications

    Get PDF
    The process of factorizing a symmetric matrix using the Cholesky (LLT ) or indefinite (LDLT ) factorization of A allows the efficient solution of systems Ax = b when A is symmetric. This thesis describes the development of new serial and parallel techniques for this problem and demonstrates them in the setting of interior point methods. In serial, the effects of various scalings are reported, and a fast and robust mixed precision sparse solver is developed. In parallel, DAG-driven dense and sparse factorizations are developed for the positive definite case. These achieve performance comparable with other world-leading implementations using a novel algorithm in the same family as those given by Buttari et al. for the dense problem. Performance of these techniques in the context of an interior point method is assessed

    Improving multifrontal solvers by means of algebraic Block Low-Rank representations

    Get PDF
    We consider the solution of large sparse linear systems by means of direct factorization based on a multifrontal approach. Although numerically robust and easy to use (it only needs algebraic information: the input matrix A and a right-hand side b, even if it can also digest preprocessing strategies based on geometric information), direct factorization methods are computationally intensive both in terms of memory and operations, which limits their scope on very large problems (matrices with up to few hundred millions of equations). This work focuses on exploiting low-rank approximations on multifrontal based direct methods to reduce both the memory footprints and the operation count, in sequential and distributed-memory environments, on a wide class of problems. We first survey the low-rank formats which have been previously developed to efficiently represent dense matrices and have been widely used to design fast solutions of partial differential equations, integral equations and eigenvalue problems. These formats are hierarchical (H and Hierarchically Semiseparable matrices are the most common ones) and have been (both theoretically and practically) shown to substantially decrease the memory and operation requirements for linear algebra computations. However, they impose many structural constraints which can limit their scope and efficiency, especially in the context of general purpose multifrontal solvers. We propose a flat format called Block Low-Rank (BLR) based on a natural blocking of the matrices and explain why it provides all the flexibility needed by a general purpose multifrontal solver in terms of numerical pivoting for stability and parallelism. We compare BLR format with other formats and show that BLR does not compromise much the memory and operation improvements achieved through low-rank approximations. A stability study shows that the approximations are well controlled by an explicit numerical parameter called low-rank threshold, which is critical in order to solve the sparse linear system accurately. Details on how Block Low-Rank factorizations can be efficiently implemented within multifrontal solvers are then given. We propose several Block Low-Rank factorization algorithms which allow for different types of gains. The proposed algorithms have been implemented within the MUMPS (MUltifrontal Massively Parallel Solver) solver. We first report experiments on standard partial differential equations based problems to analyse the main features of our BLR algorithms and to show the potential and flexibility of the approach; a comparison with a Hierarchically SemiSeparable code is also given. Then, Block Low-Rank formats are experimented on large (up to a hundred millions of unknowns) and various problems coming from several industrial applications. We finally illustrate the use of our approach as a preconditioning method for the Conjugate Gradient
    corecore