8 research outputs found

    Nested dissection with balanced halo

    Get PDF
    International audienceNested Dissection has been introduced by A. George and is a well-known and very popular heuristic for sparse matrix ordering to reduce both the fill-in and the operation count during the numerical factorization. Considering now hybrid methods mixing both direct and iterative solvers, obtaining a domain decomposition leading to a good balancing of both the size of domain interiors and the size of interfaces is a key point for load balancing and efficiency in a parallel context. For this purpose, we revisit the algorithm introduced by Lipton, Rose and Tarjan which per- formed the recursion in a different manner

    Memory Optimization to Build a Schur Complement in an Hybrid Solver

    Get PDF
    Solving linear system Ax=bAx=b in parallel where AA is a large sparse matrix is a very recurrent problem in numerical simulations. One of the state-of-the-art most promising algorithm is the hybrid method based on domain decomposition and Schur complement. In this method, a direct solver is used as a subroutine on each subdomain matrix. This approach is subject to serious memory overhead. In this paper, we investigate new techniques to reduce memory consumption during the build of the Schur complement by a direct solver. Our method allows memory peak reduction from 10% to 30% on each processus for typical test cases

    Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

    Full text link
    Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios

    On the resilience of parallel sparse hybrid solvers

    Get PDF
    International audienceAs the computational power of high performance computing (HPC) systems continues to increase by using a huge number of CPU cores or specialized processing units, extreme-scale applications are increasingly prone to faults. Consequently, the HPC community has proposed many contributions to design resilient HPC applications. These contributions may be system-oriented, theoretical or numerical. In this study we consider an actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, MAPHYS, and we propose numerical remedies to design a resilient version of the solver. The solver being hybrid, we focus in this study on the iterative solution step, which is often the dominant step in practice. We furthermore assume that a separate mechanism ensures fault detection and that a system layer provides support for setting back the environment (processes,. . .) in a running state. The present manuscript therefore focuses on (and only on) strategies for recovering lost data after the fault has been detected (a separate concern beyond the scope of this study), once the system is restored (another separate concern not studied here). The numerical remedies we propose are twofold. Whenever possible, we exploit the natural data redundancy between processes from the solver to perform exact recovery through clever copies over processes. Otherwise, data that has been lost and no longer available on any process is recovered through a so-called interpolation-restart mechanism. This mechanism is derived from [1] by carefully taking into account the properties of the target hybrid solver. These numerical remedies have been implemented in the MAPHYS parallel solver so that we can assess their efficiency on a large number of processing units (up to 12, 288 CPU cores) for solving large-scale real-life problems

    Hierarchical hybrid sparse linear solver for multicore platforms

    Get PDF
    The solution of large sparse linear systems is a critical operationfor many numerical simulations. To cope with the hierarchical designof modern supercomputers, hybrid solvers based on Domain DecompositionMethods (DDM) have been been proposed. Among them, approachesconsisting of solving the problem on the interior of the domains witha sparse direct method and the problem on their interface with apreconditioned iterative method applied to the related SchurComplement have shown an attractive potential as they can combine therobustness of direct methods and the low memory footprint of iterativemethods. In this report, we consider an additive Schwarz preconditionerfor the Schur Complement, which represents a scalable candidate butwhose numerical robustness may decrease when the number of domainsbecomes too large. We thus propose a two-level MPI/thread parallelapproach to control the number of domains and hence the numericalbehaviour. We illustrate our discussion with large-scale matricesarising from real-life applications and processed on both a moderncluster and a supercomputer. We show that the resulting method canprocess matrices such as tdr455k for which we previously either ranout of memory on few nodes or failed to converge on a larger number ofnodes. Matrices such as Nachos_4M that could not be correctly processedin the past can now be efficiently processed up to a very large numberof CPU cores (24,576 cores). The corresponding code has beenincorporated into the MaPHyS package

    Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware

    Full text link
    The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd
    corecore