15 research outputs found

    FEMPAR: Scaling Multi-Level Domain Decomposition up to the full JUQUEEN supercomputer

    Get PDF
    Type of production: technical report (JUQUEEN Extreme Scaling Workshop, Juelich Supercomputing Center, Germany, 2015). Format: contributed chapter.In conjunction with this year's JUQUEEN Porting and Tuning Workshop, which is part of the PRACE Advanced Training Centres curriculum, JSC continued its series of BlueGene Extreme Scaling Workshops. Seven application teams were invited to stay for two days and work on the scalability of their codes, with dedicated access to the entire JUQUEEN system for a period of 30 hours. Most of the teams' codes had thematic overlap with JSC Simulation Laboratories or were part of an ongoing collaboration with one of the SimLabs. The code teams came from the fields of climate science (ICON from DKRZ, and MPAS-A from KIT and NCAR), engineering (FEMPAR from UPC, and ex_nl/FE^2 from Uni Cologne and TU Freiberg), fluid dynamics (psOpen and SHOCK both from RWTH Aachen), and neuroscience (CoreNeuron from the EPFL Blue Brain Project) and were supported by JSC SimLabs and Cross-sectional teams, with IBM and JUQUEEN technical support. Within the first 24 hours of dedicated access to the entire 28 racks, all seven teams had adapted their codes and datasets to exploit the massive parallelism and restricted node memory for successful executions using all 458,752 cores. Most of them also demonstrated excellent strong or weak scalability, qualifying all but one for the High-Q Club. A total of 370 'large' jobs were executed using 12 of the 15 million core-hours of compute time allocated for the workshop. Detailed results for each code, provided by the application teams themselves, is introduced by analysis comparing them to the other 16 High-Q Club codes.Preprin

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    Multilevel balancing domain decomposition at extreme scales

    Get PDF
    In this paper we present a fully distributed, communicator-aware, recursive, and interlevel-overlapped message-passing implementation of the multilevel balancing domain decomposition by constraints (MLBDDC) preconditioner. The implementation highly relies on subcommunicators in order to achieve the desired effect of coarse-grain overlapping of computation and communication, and communication and communication among levels in the hierarchy (namely, interlevel overlapping). Essentially, the main communicator is split into as many nonoverlapping subsets of message-passing interface (MPI) tasks (i.e., MPI subcommunicators) as levels in the hierarchy. Provided that specialized resources (cores and memory) are devoted to each level, a careful rescheduling and mapping of all the computations and communications in the algorithm lets a high degree of overlapping be exploited among levels. All subroutines and associated data structures are expressed recursively, and therefore MLBDDC preconditioners with an arbitrary number of levels can be built while re-using significant and recurrent parts of the codes. This approach leads to excellent weak scalability results as soon as level-1 tasks can fully overlap coarser-levels duties. We provide a model to indicate how to choose the number of levels and coarsening ratios between consecutive levels and determine qualitatively the scalability limits for a given choice. We have carried out a comprehensive weak scalability analysis of the proposed implementation for the three-dimensional Laplacian and linear elasticity problems on structured and unstructured meshes. Excellent weak scalability results have been obtained up to 458,752 IBM BG/Q cores and 1.8 million MPI being, being the first time that exact domain decomposition preconditioners (only based on sparse direct solvers) reach these scales. (An erratum is attached.

    Parallel and scalable heat methods for geodesic distance computation

    Get PDF
    In this paper, we propose a parallel and scalable approach for geodesic distance computation on triangle meshes. Our key observation is that the recovery of geodesic distance with the heat method from [Crane et al. 2013] can be reformulated as optimization of its gradients subject to integrability, which can be solved using an efficient first-order method that requires no linear system solving and converges quickly. Afterward, the geodesic distance is efficiently recovered by parallel integration of the optimized gradients in breadth-first order. Moreover, we employ a similar breadth-first strategy to derive a parallel Gauss-Seidel solver for the diffusion step in the heat method. To further lower the memory consumption from gradient optimization on faces, we also propose a formulation that optimizes the projected gradients on edges, which reduces the memory footprint by about 50%. Our approach is trivially parallelizable, with a low memory footprint that grows linearly with respect to the model size. This makes it particularly suitable for handling large models. Experimental results show that it can efficiently compute geodesic distance on meshes with more than 200 million vertices on a desktop PC with 128GB RAM, outperforming the original heat method and other state-of-the-art geodesic distance solvers

    Reordering Strategy for Blocking Optimization in Sparse Linear Solvers

    Get PDF
    International audienceSolving sparse linear systems is a problem that arises in many scientific applications, and sparse direct solvers are a time-consuming and key kernel for those applications and for more advanced solvers such as hybrid direct-iterative solvers. For this reason, optimizing their performance on modern architectures is critical. The preprocessing steps of sparse direct solvers—ordering and block-symbolic factorization—are two major steps that lead to a reduced amount of computation and memory and to a better task granularity to reach a good level of performance when using BLAS kernels. With the advent of GPUs, the granularity of the block computation has become more important than ever. In this paper, we present a reordering strategy that increases this block granularity. This strategy relies on block-symbolic factorization to refine the ordering produced by tools such as Metis or Scotch, but it does not impact the number of operations required to solve the problem. We integrate this algorithm in the PaStiX solver and show an important reduction of the number of off-diagonal blocks on a large spectrum of matrices. This improvement leads to an increase in efficiency of up to 20% on GPUs. 1. Introduction. Many scientific applications, such as electromagnetism, astrophysics , and computational fluid dynamics, use numerical models that require solving linear systems of the form Ax = b. In those problems, the matrix A can be considered as either dense (almost no zero entries) or sparse (mostly zero entries). Due to multiple structural and numerical differences that appear in those problems, many different solutions exist to solve them. In this paper, we focus on problems leading to sparse systems with a symmetric pattern and, more specifically, on direct methods which factorize the matrix A in LL t , LDL t , or LU , with L, D, and U, respectively, unit lower triangular, diagonal, and upper triangular according to the problem numerical properties. Those sparse matrices appear mostly when discretizing partial differential equations (PDEs) on two-(2D) and three-(3D) dimensional finite element or finite volume meshes. The main issue with such factorizations is the fill-in—zero entries becoming nonzero—that appears in the factorized form of A during the execution of the algorithm. If not correctly considered, the fill-in can transform the sparse matrix into a dense one which might not fit in memory. In this context, sparse direct solvers rely on two important preprocessing steps to reduce this fill-in and control where it appears. The first one finds a suitable ordering of the unknowns that aims at minimizing the fill-in to limit the memory overhead and floating point operations (Flops) required to complete the factorization. The problem is then transformed into (P AP t)(P x) = P b