4 research outputs found

    Asynchronous versions of Jacobi, multigrid, and Chebyshev solvers

    Get PDF
    Iterative methods are commonly used for solving large, sparse systems of linear equations on parallel computers. Implementations of parallel iterative solvers contain kernels (e.g., parallel sparse matrix-vector products) in which parallel processes alternate between phases of computation and communication. Standard software packages use synchronous implementations where there are one or more synchronization points per iteration. These synchronization points occur during communication phases where each process sends data to other processes and idles until all data needed for the next iteration is received. Synchronization points scale poorly on massively parallel machines and may become the primary bottleneck for future exascale computers. This calls for research and development of asynchronous iterative methods, which is the subject of this dissertation. In asynchronous iterative methods there are no synchronization points. This means that, after a phase of computation, processes immediately proceed to the next phase of computation using whatever data is currently available. Since the late 1960s, research on asynchronous methods has primarily considered basic fixed-point methods, e.g., Jacobi, where proving asymptotic convergence bounds has been the focus. However, the practical behavior of asynchronous methods is not well understood, and asynchronous versions of certain fast-converging solvers have not been developed. This dissertation focuses on studying the practical behavior of asynchronous Jacobi, developing new communication-avoiding asynchronous iterative solvers, and introducing the first asynchronous versions of multigrid and Chebyshev. To better understand the practical behavior of asynchronous Jacobi, we examine a model of asynchronous Jacobi where communication delays are neglected. We call this model simplified asynchronous Jacobi. Simplified asynchronous Jacobi can be used to model asynchronous Jacobi implemented in shared memory or distributed memory with fast communication networks. We analyze simplified asynchronous Jacobi for linear systems where the coefficient matrix is symmetric positive-definite and compare our analysis to experimental results from shared and distributed memory implementations. We present three important results for asynchronous Jacobi: it can converge when synchronous Jacobi does not, it can reduce the residual norm when some processes are delayed, and its convergence rate can increase with increasing parallelism. We develop new asynchronous communication-avoiding methods using the idea of the sequential Southwell method. In the sequential Southwell method, which converges faster than Gauss-Seidel, the component of the residual with the largest residual in absolute value is relaxed during each iteration. We use the idea of choosing large residual values to create communication-avoiding parallel methods, where residual values of communication neighbors are compared rather than computing a global maximum. We present three methods: the Parallel Southwell, Distributed Southwell, and Stochastic Parallel Southwell methods. All our methods converge faster than Jacobi and use less communication. We introduce the first asynchronous multigrid methods. We use the idea of additive multigrid where smoothing on all grids is carried out concurrently. We present models of asynchronous additive multigrid and use these models to study the convergence properties of asynchronous multigrid. We also introduce algorithms for implementing asynchronous multigrid in shared and distributed memory. Our experimental results show that asynchronous multigrid can exhibit grid-size independent convergence and can be faster than classical multigrid in terms of wall-clock time. Lastly, we present the first asynchronous Chebyshev methods. We present models of Jacobi-preconditioned asynchronous Chebyshev. We use a little-known version of the BPX multigrid preconditioner where BPX is written as Jacobi on an extended system, which makes BPX convenient for asynchronous execution within Chebsyhev. Our experimental results show that asynchronous Chebyshev is faster than its synchronous counterpart in terms of both wall-clock time and number of iterations.Ph.D

    Automatische Codegenerierung fĂŒr Massiv Parallele Applikationen in der Numerischen Strömungsmechanik

    Get PDF
    Solving partial differential equations (PDEs) is a fundamental challenge in many application domains in industry and academia alike. With increasingly large problems, efficient and highly scalable implementations become more and more crucial. Today, facing this challenge is more difficult than ever due to the increasingly heterogeneous hardware landscape. One promising approach is developing domain‐specific languages (DSLs) for a set of applications. Using code generation techniques then allows targeting a range of hardware platforms while concurrently applying domain‐specific optimizations in an automated fashion. The present work aims to further the state of the art in this field. As domain, we choose PDE solvers and, in particular, those from the group of geometric multigrid methods. To avoid having a focus too broad, we restrict ourselves to methods working on structured and patch‐structured grids. We face the challenge of handling a domain as complex as ours, while providing different abstractions for diverse user groups, by splitting our external DSL ExaSlang into multiple layers, each specifying different aspects of the final application. Layer 1 is designed to resemble LaTeX and allows inputting continuous equations and functions. Their discretization is expressed on layer 2. It is complemented by algorithmic components which can be implemented in a Matlab‐like syntax on layer 3. All information provided to this point is summarized on layer 4, enriched with particulars about data structures and the employed parallelization. Additionally, we support automated progression between the different layers. All ExaSlang input is processed by our jointly developed Scala code generation framework to ultimately emit C++ code. We particularly focus on how to generate applications parallelized with, e.g., MPI and OpenMP that are able to run on workstations and large‐scale cluster alike. We showcase the applicability of our approach by implementing simple test problems, like Poisson’s equation, as well as relevant applications from the field of computational fluid dynamics (CFD). In particular, we implement scalable solvers for the Stokes, Navier‐Stokes and shallow water equations (SWE) discretized using finite differences (FD) and finite volumes (FV). For the case of Navier‐Stokes, we also extend our implementation towards non‐uniform grids, thereby enabling static mesh refinement, and advanced effects such as the simulated fluid being non‐Newtonian and non‐isothermal
    corecore