116 research outputs found

    A quasi‐cache‐aware model for optimal domain partitioning in parallel geometric multigrid

    Get PDF
    Stencil computations form the heart of numerical simulations to solve Partial Differential Equations using Finite Difference, Finite Element, and Finite Volume methods. Geometric Multigrid is an optimal O(N), hierarchical tool employing stencil computations in its chief constituents, namely, smoothing, restriction, and interpolation. When Multigrid is parallelized over distributed‐shared memory architectures, traditionally, the domain partitioning creates cubic partitions of the mesh to minimize overall communication. Thus, the orthodox approach considers only load‐balancing and communication minimization for completely determining the domain partitioning. In this article, we show that these two factors are not sufficient to obtain optimal partitions for Parallel Geometric Multigrid. To this effect, we develop and validate a high level analytical model to show that “close to 2‐D” partitions for Geometric Multigrid can give higher performance than the partitions returned by the MPI_Dims_create() function which minimizes the communication volume by default. We quantify sub‐domain level cache‐misses in Parallel Geometric Multigrid and obtain families of optimal domain partitions. We conclude that the sub‐domain level cache‐misses for the application‐specific stencil computational kernel and communicated planes should be taken into account in addition to communication minimization/load‐balance to obtain optimal partitions for Parallel Geometric Multigrid

    Efficient Domain Partitioning for Stencil-based Parallel Operators

    Get PDF
    Partial Differential Equations (PDEs) are used ubiquitously in modelling natural phenomena. It is generally not possible to obtain an analytical solution and hence they are commonly discretized using schemes such as the Finite Difference Method (FDM) and the Finite Element Method (FEM), converting the continuous PDE to a discrete system of sparse algebraic equations. The solution of this system can be approximated using iterative methods, which are better suited to many sparse systems than direct methods. In this thesis we use the FDM to discretize linear, second order, Elliptic PDEs and consider parallel implementations of standard iterative solvers. The dominant paradigm in this field is distributed memory parallelism which requires the FDM grid to be partitioned across the available computational cores. The orthodox approach to domain partitioning aims to minimize only the communication volume and achieve perfect load-balance on each core. In this work, we re-examine and challenge this traditional method of domain partitioning and show that for well load-balanced problems, minimizing only the communication volume is insufficient for obtaining optimal domain partitions. To this effect we create a high-level, quasi-cache-aware mathematical model that quantifies cache-misses at the sub-domain level and minimizes them to obtain families of high performing domain decompositions. To our knowledge this is the first work that optimizes domain partitioning by analyzing cache misses, establishing a relationship between cache-misses and domain partitioning. To place our model in its true context, we identify and qualitatively examine multiple other factors such as the Least Recently Used policy, Cache Line Utilization and Vectorization, that influence the choice of optimal sub-domain dimensions. Since the convergence rate of point iterative methods, such as Jacobi, for uniform meshes is not acceptable at a high mesh resolution, we extend the model to Parallel Geometric Multigrid (GMG). GMG is a multilevel, iterative, optimal algorithm for numerically solving Elliptic PDEs. Adaptive Mesh Refinement (AMR) is another multilevel technique that allows local refinement of a global mesh based on parameters such as error estimates or geometric importance. We study a massively parallel, multiphysics, multi-resolution AMR framework called BoxLib, and implement and discuss our model on single level and adaptively refined meshes, respectively. We conclude that “close to 2-D” partitions are optimal for stencil-based codes on structured 3-D domains and that it is necessary to optimize for both minimizing cache-misses and communication. We advise that in light of the evolving hardware-software ecosystem, there is an imperative need to re-examine conventional domain partitioning strategies

    A Cache-Aware Approach to Domain Decomposition for Stencil-Based Codes

    Get PDF
    Partial Differential Equations (PDEs) lie at the heart of numerous scientific simulations depicting physical phenomena. The parallelization of such simulations introduces additional performance penalties in the form of local and global synchronization among cooperating processes. Domain decomposition partitions the largest shareable data structures into sub-domains and attempts to achieve perfect load balance and minimal communication. Up to now research efforts to optimize spatial and temporal cache reuse for stencil-based PDE discretizations (e.g. finite difference and finite element) have considered sub-domain operations after the domain decomposition has been determined. We derive a cache-oblivious heuristic that minimizes cache misses at the sub-domain level through a quasi-cache-directed analysis to predict families of high performance domain decompositions in structured 3-D grids. To the best of our knowledge this is the first work to optimize domain decompositions by analyzing cache misses - thus connecting single core parameters (i.e. cache-misses) to true multicore parameters (i.e. domain decomposition). We analyze the trade-offs in decreasing cache-misses through such decompositions and increasing the dynamic bandwidth-per-core. The limitation of our work is that currently, it is applicable only to structured 3-D grids with cuts parallel to the Cartesian Axes. We emphasize and conclude that there is an imperative need to re-think domain decompositions in this constantly evolving multicore era

    Schnelle Löser fĂŒr partielle Differentialgleichungen

    Get PDF
    The workshop Schnelle Löser für partielle Differentialgleichungen, organised by Randolph E. Bank (La Jolla), Wolfgang Hackbusch(Leipzig), Gabriel Wittum (Heidelberg) was held May 22nd - May 28th, 2005. This meeting was well attended by 47 participants with broad geographic representation from 9 countries and 3 continents. This workshop was a nice blend of researchers with various backgrounds

    New approaches for efficient on-the-fly FE operator assembly in a high-performance mantle convection framework

    Get PDF

    Asynchronous Stabilisation and Assembly Techniques for Additive Multigrid

    Get PDF
    Multigrid solvers are among the best solvers in the world, but once applied in the real world there are issues they must overcome. Many multigrid phases exhibit low concurrency. Mesh and matrix assembly are challenging to parallelise and introduce algorithmic latency. Dynamically adaptive codes exacerbate these issues. Multigrid codes require the computation of a cascade of matrices and dynamic adaptivity means these matrices are recomputed throughout the solve. Existing methods to compute the matrices are expensive and delay the solve. Non- trivial material parameters further increase the cost of accurate equation integration. We propose to assemble all matrix equations as stencils in a delayed element-wise fashion. Early multigrid iterations use cheap geometric approximations and more accurate updated stencil integrations are computed in parallel with the multigrid cycles. New stencil integrations are evaluated lazily and asynchronously fed to the solver once they become available. They do not delay multigrid iterations. We deploy stencil integrations as parallel tasks that are picked up by cores that would otherwise be idle. Coarse grid solves in multiplicative multigrid also exhibit limited concurrency. Small coarse mesh sizes correspond to small computational workload and require costly synchronisation steps. This acts as a bottleneck and delays solver iterations. Additive multigrid avoids this restriction, but becomes unstable for non-trivial material parameters as additive coarse grid levels tend to overcorrect. This leads to oscillations. We propose a new additive variant, adAFAC-x, with a stabilisation parameter that damps coarse grid corrections to remove oscillations. Per-level we solve an additional equation that produces an auxiliary correction. The auxiliary correction can be computed additively to the rest of the solve and uses ideas similar to smoothed aggregation multigrid to anticipate overcorrections. Pipelining techniques allow adAFAC-x to be written using single-touch semantics on a dynamically adaptive mesh

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    A Cache-Aware Approach to Adaptive Mesh Refinement in Parallel Stencil-based Solvers

    Get PDF
    In prior-research the authors have demonstrated that, for stencil-based numerical solvers for Partial Differential Equations (PDEs), the parallel performance can be significantly improved by selecting sub-domains that are not cubic in shape (Saxena et. al., HPCS 2016, pp. 875-885). This is achieved through accounting for cache utilization in both the message passing and the computational kernel, where it is demonstrated that the optimal domain decompositions not only depend on the communication and load balance but also on the cache-misses, amongst other factors. In this work we demonstrate that those conclusions may also be extended to more advanced numerical discretizations, based upon Adaptive Mesh Refinement (AMR). In particular, we show that when basing our AMR strategy on the local refinement of patches of the mesh, the optimal patch shape is not typically cubic. We provide specific examples, with accompanying explanation, to show that communication minimizing strategies are not necessarily the best choice when applying AMR in parallel. All numerical tests undertaken in this work are based upon the open source BoxLib library
    • 

    corecore