41 research outputs found

    The Peano software---parallel, automaton-based, dynamically adaptive grid traversals

    Get PDF
    We discuss the design decisions, design alternatives, and rationale behind the third generation of Peano, a framework for dynamically adaptive Cartesian meshes derived from spacetrees. Peano ties the mesh traversal to the mesh storage and supports only one element-wise traversal order resulting from space-filling curves. The user is not free to choose a traversal order herself. The traversal can exploit regular grid subregions and shared memory as well as distributed memory systems with almost no modifications to a serial application code. We formalize the software design by means of two interacting automata—one automaton for the multiscale grid traversal and one for the application-specific algorithmic steps. This yields a callback-based programming paradigm. We further sketch the supported application types and the two data storage schemes realized before we detail high-performance computing aspects and lessons learned. Special emphasis is put on observations regarding the used programming idioms and algorithmic concepts. This transforms our report from a “one way to implement things” code description into a generic discussion and summary of some alternatives, rationale, and design decisions to be made for any tree-based adaptive mesh refinement software

    SFC-based Communication Metadata Encoding for Adaptive Mesh

    Get PDF
    This volume of the series “Advances in Parallel Computing” contains the proceedings of the International Conference on Parallel Programming – ParCo 2013 – held from 10 to 13 September 2013 in Garching, Germany. The conference was hosted by the Technische Universität München (Department of Informatics) and the Leibniz Supercomputing Centre.The present paper studies two adaptive mesh refinement (AMR) codes whose grids rely on recursive subdivison in combination with space-filling curves (SFCs). A non-overlapping domain decomposition based upon these SFCs yields several well-known advantageous properties with respect to communication demands, balancing, and partition connectivity. However, the administration of the meta data, i.e. to track which partitions exchange data in which cardinality, is nontrivial due to the SFC’s fractal meandering and the dynamic adaptivity. We introduce an analysed tree grammar for the meta data that restricts it without loss of information hierarchically along the subdivision tree and applies run length encoding. Hence, its meta data memory footprint is very small, and it can be computed and maintained on-the-fly even for permanently changing grids. It facilitates a forkjoin pattern for shared data parallelism. And it facilitates replicated data parallelism tackling latency and bandwidth constraints respectively due to communication in the background and reduces memory requirements by avoiding adjacency information stored per element. We demonstrate this at hands of shared and distributed parallelized domain decompositions.This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing (SFB/TR 89). It is partially based on work supported by Award No. UK-c0020, made by the King Abdullah University of Science and Technology (KAUST)

    Block Fusion on Dynamically Adaptive Spacetree Grids for Shallow Water Waves

    Get PDF
    Spacetrees are a popular formalism to describe dynamically adaptive Cartesian grids. Even though they directly yield a mesh, it is often computationally reasonable to embed regular Cartesian blocks into their leaves. This promotes stencils working on homogeneous data chunks. The choice of a proper block size is sensitive. While large block sizes foster loop parallelism and vectorisation, they restrict the adaptivity's granularity and hence increase the memory footprint and lower the numerical accuracy per byte. In the present paper, we therefore use a multiscale spacetree-block coupling admitting blocks on all spacetree nodes. We propose to find sets of blocks on the finest scale throughout the simulation and to replace them by fused big blocks. Such a replacement strategy can pick up hardware characteristics, i.e. which block size yields the highest throughput, while the dynamic adaptivity of the fine grid mesh is not constrained—applications can work with fine granular blocks. We study the fusion with a state-of-the-art shallow water solver at hands of an Intel Sandy Bridge and a Xeon Phi processor where we anticipate their reaction to selected block optimisation and vectorisation

    Parallel Multiscale Contact Dynamics for Rigid Non-spherical Bodies

    Get PDF
    The simulation of large numbers of rigid bodies of non-analytical shapes or vastly varying sizes which collide with each other is computationally challenging. The fundamental problem is the identification of all contact points between all particles at every time step. In the Discrete Element Method (DEM), this is particularly difficult for particles of arbitrary geometry that exhibit sharp features (e.g. rock granulates). While most codes avoid non-spherical or non-analytical shapes due to the computational complexity, we introduce an iterative-based contact detection method for triangulated geometries. The new method is an improvement over a naive brute force approach which checks all possible geometric constellations of contact and thus exhibits a lot of execution branching. Our iterative approach has limited branching and high floating point operations per processed byte. It thus is suitable for modern Single Instruction Multiple Data (SIMD) CPU hardware. As only the naive brute force approach is robust and always yields a correct solution, we propose a hybrid solution that combines the best of the two worlds to produce fast and robust contacts. In terms of the DEM workflow, we furthermore propose a multilevel tree-based data structure strategy that holds all particles in the domain on multiple scales in grids. Grids reduce the total computational complexity of the simulation. The data structure is combined with the DEM phases to form a single touch tree-based traversal that identifies both contact points between particle pairs and introduces concurrency to the system during particle comparisons in one multiscale grid sweep. Finally, a reluctant adaptivity variant is introduced which enables us to realise an improved time stepping scheme with larger time steps than standard adaptivity while we still minimise the grid administration overhead. Four different parallelisation strategies that exploit multicore architectures are discussed for the triad of methodological ingredients. Each parallelisation scheme exhibits unique behaviour depending on the grid and particle geometry at hand. The fusion of them into a task-based parallelisation workflow yields promising speedups. Our work shows that new computer architecture can push the boundary of DEM computability but this is only possible if the right data structures and algorithms are chosen

    Complex additive geometric multilevel solvers for Helmholtz equations on spacetrees

    Get PDF
    We introduce a family of implementations of low-order, additive, geometric multilevel solvers for systems of Helmholtz equations arising from Schrödinger equations. Both grid spacing and arithmetics may comprise complex numbers, and we thus can apply complex scaling to the indefinite Helmholtz operator. Our implementations are based on the notion of a spacetree and work exclusively with a finite number of precomputed local element matrices. They are globally matrix-free. Combining various relaxation factors with two grid transfer operators allows us to switch from additive multigrid over a hierarchical basis method into a Bramble-Pasciak-Xu (BPX)-type solver, with several multiscale smoothing variants within one code base. Pipelining allows us to realize full approximation storage (FAS) within the additive environment where, amortized, each grid vertex carrying degrees of freedom is read/written only once per iteration. The codes realize a single-touch policy. Among the features facilitated by matrix-free FAS is arbitrary dynamic mesh refinement (AMR) for all solver variants. AMR as an enabler for full multigrid (FMG) cycling—the grid unfolds throughout the computation—allows us to reduce the cost per unknown. The present work primary contributes toward software realization and design questions. Our experiments show that the consolidation of single-touch FAS, dynamic AMR, and vectorization-friendly, complex scaled, matrix-free FMG cycles delivers a mature implementation blueprint for solvers of Helmholtz equations in general. For this blueprint, we put particular emphasis on a strict implementation formalism as well as some implementation correctness proofs

    Cluster-based communication and load balancing for simulations on dynamically adaptive grids

    Get PDF
    short paperThe present paper introduces a new communication and load-balancing scheme based on a clustering of the grid which we use for the efficient parallelization of simulations on dynamically adaptive grids. With a partitioning based on space-filling curves (SFCs), this yields several advantageous properties regarding the memory requirements and load balancing. However, for such an SFC- based partitioning, additional connectivity information has to be stored and updated for dynamically changing grids. In this work, we present our approach to keep this connectivity information run-length encoded (RLE) only for the interfaces shared between partitions. Using special properties of the underlying grid traversal and used communication scheme, we update this connectivity information implicitly for dynamically changing grids and can represent the connectivity information as a sparse communication graph: graph nodes (partitions) represent bulks of connected grid cells and each graph edge (RLE connectivity information) a unique relation between adjacent partitions. This directly leads to an efficient shared-memory parallelization with graph nodes assigned to computing cores and an efficient en bloc data exchange via graph edges. We further refer to such a partitioning approach with RLE meta information as a cluster-based domain decomposition and to each partition as a cluster. With the sparse communication graph in mind, we then extend the connectivity information represented by the graph edges with MPI ranks, yielding an en bloc communication for distributed-memory systems and a hybrid parallelization. For data migration, the stack-based intra-cluster communication allows a very low memory footprint for data migration and the RLE leads to efficient updates of connectivity information. Our benchmark is based on a shallow water simulation on a dynamically adaptive grid. We conducted performance studies for MPI-only and hybrid parallelizations, yielding an efficiency of over 90% on 256 cores. Furthermore, we demonstrate the applicability of cluster-based optimizations on distributed-memory systems.We like to thank the Munich Centre of Advanced Computing for for funding this project by providing computing time on the MAC Cluster. This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre ”Invasive Computing” (SFB/TR 89)
    corecore