131 research outputs found

    A study of various load information exchange mechanisms for a distributed application using dynamic scheduling

    Get PDF
    We consider a distributed asynchronous system where processes can only communicate by message passing and need a coherent view of the load(e.g.,workload,memory) of others to take dynamic decisions (scheduling).We present several mechanisms to obtain a distributed view of such information,based eithe ron maintaining that view or demand-driven witha snapshot algorithm.We perform an experimental study in the context of a real application,an asynchronous parallel solver for large sparse systems of linear equationsNous considérons un système distribué et asynchrone où les processus peuvent seulement communiquer par passage de messages, et requièrent une estimation correcte de la charge (travail en attente, mémoire utilisée) des autres processus pour procéder à  des décisions dynamiques liées à  l'ordonnancement des tâches de calcul. Nous présentons plusieurs types de mécanismes pour obtenir une vision distribuée de telles informations. Dans un premier type d'approches, la vision est maintenue grâce à des échanges de messages réguliers; dans le deuxième type d'approches (mécanismes à  la demande ou de type snapshot), le processus demandeur des informations émet une requête, et reçoit ensuite les informations de charge correspondant à  sa demande. Nous expérimentons ces approches dans le cadre d'une application réelle utilisant des ordonnanceurs dynamiques distribués

    Some Experiments and Issues to Exploit Multicore Parallelism in a Distributed-Memory Parallel Sparse Direct Solver

    Get PDF
    MUMPS is a parallel sparse direct solver, using message passing (MPI) for parallelism. In this report we experiment how thread parallelism can help taking advantage of recent multicore architectures. The work done consists in testing multithreaded BLAS libraries and inserting OpenMP directives in the routines revealed to be costly by profiling, with the objective to avoid any deep restructuring or rewriting of the code. We report on various aspects of this work, present some of the benefits and difficulties, and show that 4 threads per MPI process is generally a good compromise. We then discuss various issues that appear to be critical in a mixed MPI-OpenMP environment

    Introduction of shared-memory parallelism in a distributed-memory multifrontal solver

    Get PDF
    We study the adaptation of a parallel distributed-memory solver towards a shared-memory code, targeting multi-core architectures. The advantage of adapting the code over a new design is to fully benefit from its numerical kernels, range of functionalities and internal features. Although the studied code is a direct solver for sparse systems of linear equations, the approaches described in this paper are general and could be useful to a wide range of applications. We show how existing parallel algorithms can be adapted to an OpenMP environment while, at the same time, also relying on third-party optimized multithreaded libraries. We propose simple approaches to take advantage of NUMA architectures, and original optimizations to limit thread synchronization costs. For each point, the performance gains are analyzed in detail on test problems from various application areas.Dans cet article, nous étudions l'adaptation d'un code parallèle à mémoire distribuée en un code visant les architectures à mémoire partagée de type multi-coeurs. L'intérêt d'adapter un code existant plutôt que d'en concevoir un nouveau est de pouvoir bénéficier directement de toute la richesse de ses fonctionnalités numériques ainsi que de ses caractéristiques internes. Même si le code sur lequel porte l'étude est un solveur direct multifrontale pour systèmes linéaires creux, les algorithmes et techniques discutés sont générales et peuvent s'appliquer à des domaines d'application plus généraux. Nous montrons comment des algorithmes parallèles existant peuvent être adaptés à un environnement OpenMP tout en exploitant au mieux des librairies existantes optimisées. Nous présentons des approches simples pour tirer parti des spécificités des architectures NUMA, ainsi que des optimisations originales permettant de limiter les coûts de synchronisation dans le modèle fork-join que l'on utilise. Pour chacun de ces points, les gains en performance sont analysés sur des cas tests provenant de domaines d'applications variés

    Scilab and MATLAB Interfaces to MUMPS (version 4.6 or greater)

    Get PDF
    This document describes the Scilab and MATLAB interfaces to MUMPS version 4.6. We describe the differences and similarities between usual Fortran/C MUMPS interfaces and its Scilab/MATLAB interfaces, the calling sequences and functionalities. Examples of use and experimental results are also provided.Nous décrivons les séquences d’appel et les fonctionnalités de nos interfaces Scilab/MATLAB et nous évoquons ses différences et similarités avec les interfaces Fortran/C habituelles de MUMPS. Nous présentons aussi des exemples d’utilisation et quelques résultats expérimentau

    Parallel computation of entries of A-1

    Get PDF
    In this paper, we are concerned about computing in parallel several entries of the inverse of a large sparse matrix. We assume that the matrix has already been factorized by a direct method and that the factors are distributed. Entries are efficiently computed by exploiting sparsity of the right-hand sides and the solution vectors in the triangular solution phase. We demonstrate that in this setting, parallelism and computational efficiency are two contrasting objectives. We develop an efficient approach and show its efficacy by runs using the MUMPS code that implements a parallel multifrontal method

    Modeling 1D distributed-memory dense kernels for an asynchronous multifrontal sparse solver

    Get PDF
    To solve sparse systems of linear equations, multifrontal methods rely on dense partial LU decompositions of so-called frontal matrices; we consider a parallel asynchronous setting in which several frontal matrices can be factored simultaneously. In this context, to address performance and scalability issues of acyclic pipelined asynchronous factorization kernels, we study models to revisit properties of left and right-looking variants of partial LULU decompositions, study the use of several levels of blocking, before focusing on communication issues. The general purpose sparse solver MUMPS has been modified to implement the proposed algorithms and confirm the properties demonstrated by the models

    Introduction of shared-memory parallelism in a distributed-memory multifrontal solver

    Get PDF
    We study the adaptation of a parallel distributed-memory solver towards a shared-memory code, targeting multi-core architectures. The advantage of adapting the code over a new design is to fully benefit from its numerical kernels, range of functionalities and internal features. Although the studied code is a direct solver for sparse systems of linear equations, the approaches described in this paper are general and could be useful to a wide range of applications. We show how existing parallel algorithms can be adapted to an OpenMP environment while, at the same time, also relying on third-party optimized multithreaded libraries. We propose simple approaches to take advantage of NUMA architectures, and original optimizations to limit thread synchronization costs. For each point, the performance gains are analyzed in detail on test problems from various application areas.Dans cet article, nous étudions l'adaptation d'un code parallèle à mémoire distribuée en un code visant les architectures à mémoire partagée de type multi-coeurs. L'intérêt d'adapter un code existant plutôt que d'en concevoir un nouveau est de pouvoir bénéficier directement de toute la richesse de ses fonctionnalités numériques ainsi que de ses caractéristiques internes. Même si le code sur lequel porte l'étude est un solveur direct multifrontale pour systèmes linéaires creux, les algorithmes et techniques discutés sont générales et peuvent s'appliquer à des domaines d'application plus généraux. Nous montrons comment des algorithmes parallèles existant peuvent être adaptés à un environnement OpenMP tout en exploitant au mieux des librairies existantes optimisées. Nous présentons des approches simples pour tirer parti des spécificités des architectures NUMA, ainsi que des optimisations originales permettant de limiter les coûts de synchronisation dans le modèle fork-join que l'on utilise. Pour chacun de ces points, les gains en performance sont analysés sur des cas tests provenant de domaines d'applications variés

    Improving multifrontal methods by means of block low-rank representations

    Get PDF
    Submitted for publication to SIAMMatrices coming from elliptic Partial Differential Equations (PDEs) have been shown to have a low-rank property: well defined off-diagonal blocks of their Schur complements can be approximated by low-rank products. Given a suitable ordering of the matrix which gives to the blocks a geometrical meaning, such approximations can be computed using an SVD or a rank-revealing QR factorization. The resulting representation offers a substantial reduction of the memory requirement and gives efficient ways to perform many of the basic dense algebra operations. Several strategies have been proposed to exploit this property. We propose a low-rank format called Block Low-Rank (BLR), and explain how it can be used to reduce the memory footprint and the complexity of direct solvers for sparse matrices based on the multifrontal method. We present experimental results that show how the BLR format delivers gains that are comparable to those obtained with hierarchical formats such as Hierarchical matrices (H matrices) and Hierarchically Semi-Separable (HSS matrices) but provides much greater flexibility and ease of use which are essential in the context of a general purpose, algebraic solver

    Robust memory-aware mappings for parallel multifrontal factorizations

    Get PDF
    International audienceWe study the memory scalability of the parallel multifrontal factorization of sparse matrices. In particular, we are interested in controlling the active memory specific to the multifrontal factorization. We illustrate why commonly used mapping strategies (e.g., the proportional mapping) cannot provide a high memory efficiency, which means that they tend to let the memory usage of the factorization grow when the number of processes increases. We propose “memory-aware” algorithms that aim at maximizing the granularity of parallelism while respecting memory constraints. These algorithms provide accurate memory estimates prior to the factorization and can significantly enhance the robustness of a multifrontal code. We illustrate our approach with experiments performed on large matrices
    • …
    corecore