483 research outputs found

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

    Full text link
    We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

    Geometry-Oblivious FMM for Compressing Dense SPD Matrices

    Full text link
    We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in Nlog⁥NN \log N or even NN time, where NN is the matrix size. Compression requires Nlog⁥NN \log N storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1

    Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver

    Get PDF
    International audienceThe advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work

    Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

    Get PDF
    International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications , namely, sparse matrix multifrontal factorizations which feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system. ACM Reference Format: Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche and Florent Lopez, 2014. Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime system

    Task-based hybrid linear solver for distributed memory heterogeneous architectures

    Get PDF
    Heterogeneity is emerging as one of the most challenging characteristics of today’s parallel environments. However, not many fully-featured advanced numerical, scientific libraries have been ported on such architectures. In this paper, we propose to extend a sparse hybrid solver for handling distributed memory heterogeneous platforms. As in the original solver, we perform a domain decomposition and associate one subdomain with one MPI process. However, while each subdomain was processed sequentially (binded onto a single CPU core) in the original solver, the new solver instead relies on task-based local solvers, delegating tasks to available computing units. We show that this “MPI+task” design conveniently allows for exploiting distributed memory heterogeneous machines. Indeed, a subdomain can now be processed on multiple CPU cores (such as a whole multicore processor or a subset of the available cores) possibly enhanced with GPUs. We illustrate our discussion with the MaPHyS sparse hybrid solver relying on the PaStiX and Chameleon dense and sparse direct libraries, respectively. Interestingly, this two-level MPI+task design furthermore provides extra flexibility for controlling the number of subdomains, enhancing the numerical stability of the considered hybrid method. While the rise of heterogeneous computing has been strongly carried out by the theoretical community, this study aims at showing that it is now also possible to build complex software layers on top of runtime systems to exploit heterogeneous architectures

    Task-based multifrontal QR solver for heterogeneous architectures

    Get PDF
    Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modĂšles de programmations basĂ©s sur un parallĂ©lisme de tĂąche ont gagnĂ© en popularitĂ© dans la communautĂ© du calcul scientifique haute performance. Les moteurs d'exĂ©cution fournissent une interface de programmation qui correspond Ă  ce paradigme ainsi que des outils pour l'ordonnancement des tĂąches qui dĂ©finissent l'application. Dans cette Ă©tude, nous explorons la conception de solveurs directes creux Ă  base de tĂąches, qui reprĂ©sentent une charge de travail extrĂȘmement irrĂ©guliĂšre, avec des tĂąches de granularitĂ©s et de caractĂ©ristiques diffĂ©rentes ainsi qu'une consommation mĂ©moire variable, au-dessus d'un moteur d'exĂ©cution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilitĂ© et l'efficacitĂ© de notre approche avec l'implĂ©mentation d'une mĂ©thode multifrontale pour la factorisation de matrices creuses, en se basant sur le modĂšle de programmation parallĂšle appelĂ© "flux de tĂąches sĂ©quentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de dĂ©velopper des fonctionnalitĂ©s telles que l'intĂ©gration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modĂšles de programmations basĂ©s sur un parallĂ©lisme de tĂąche ont gagnĂ© en popularitĂ© dans la communautĂ© du calcul scientifique haute performance. Les moteurs d'exĂ©cution fournissent une interface de programmation qui correspond Ă  ce paradigme ainsi que des outils pour l'ordonnancement des tĂąches qui dĂ©finissent l'application. Dans cette Ă©tude, nous explorons la conception de solveurs directes creux Ă  base de tĂąches, qui reprĂ©sentent une charge de travail extrĂȘmement irrĂ©guliĂšre, avec des tĂąches de granularitĂ©s et de caractĂ©ristiques diffĂ©rentes ainsi qu'une consommation mĂ©moire variable, au-dessus d'un moteur d'exĂ©cution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilitĂ© et l'efficacitĂ© de notre approche avec l'implĂ©mentation d'une mĂ©thode multifrontale pour la factorisation de matrices creuses, en se basant sur le modĂšle de programmation parallĂšle appelĂ© "flux de tĂąches sĂ©quentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de dĂ©velopper des fonctionnalitĂ©s telles que l'intĂ©gration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modĂšles de programmations basĂ©s sur un parallĂ©lisme de tĂąche ont gagnĂ© en popularitĂ© dans la communautĂ© du calcul scientifique haute performance. Les moteurs d'exĂ©cution fournissent une interface de programmation qui correspond Ă  ce paradigme ainsi que des outils pour l'ordonnancement des tĂąches qui dĂ©finissent l'application. !!br0ken!!ommunications" (Communication Avoiding) dans la mĂ©thode multifrontale, permettant d'amĂ©liorer considĂ©rablement la scalabilitĂ© du solveur par rapport a l'approche original utilisĂ©e dans qr mumps. Nous introduisons Ă©galement un algorithme d'ordonnancement sous contraintes mĂ©moire au sein de notre solveur, exploitable dans le cas des architectures multicoeur, rĂ©duisant largement la consommation mĂ©moire de la mĂ©thode multifrontale QR avec un impacte nĂ©gligeable sur les performances. En utilisant le modĂšle prĂ©sentĂ© ci-dessus, nous visons ensuite l'exploitation des architectures hĂ©tĂ©rogĂšnes pour lesquelles la granularitĂ© des tĂąches ainsi les stratĂ©gies l'ordonnancement sont cruciales pour profiter de la puissance de ces architectures. Nous proposons, dans le cadre de la mĂ©thode multifrontale, un partitionnement hiĂ©rarchique des donnĂ©es ainsi qu'un algorithme d'ordonnancement capable d'exploiter l'hĂ©tĂ©rogĂ©nĂ©itĂ© des ressources. Enfin, nous prĂ©sentons une Ă©tude sur la reproductibilitĂ© de l'exĂ©cution parallĂšle de notre problĂšme et nous montrons Ă©galement l'utilisation d'un modĂšle de programmation alternatif pour l'implĂ©mentation de la mĂ©thode multifrontale. L'ensemble des rĂ©sultats expĂ©rimentaux prĂ©sentĂ©s dans cette Ă©tude sont Ă©valuĂ©s avec une analyse dĂ©taillĂ©e des performance que nous proposons au dĂ©but de cette Ă©tude. Cette analyse de performance permet de mesurer l'impacte de plusieurs effets identifiĂ©s sur la scalabilitĂ© et la performance de nos algorithmes et nous aide ainsi Ă  comprendre pleinement les rĂ©sultats obtenu lors des tests effectuĂ©s avec notre solveur.To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver
    • 

    corecore