1,544 research outputs found

    Cluster-based communication and load balancing for simulations on dynamically adaptive grids

    Get PDF
    short paperThe present paper introduces a new communication and load-balancing scheme based on a clustering of the grid which we use for the efficient parallelization of simulations on dynamically adaptive grids. With a partitioning based on space-filling curves (SFCs), this yields several advantageous properties regarding the memory requirements and load balancing. However, for such an SFC- based partitioning, additional connectivity information has to be stored and updated for dynamically changing grids. In this work, we present our approach to keep this connectivity information run-length encoded (RLE) only for the interfaces shared between partitions. Using special properties of the underlying grid traversal and used communication scheme, we update this connectivity information implicitly for dynamically changing grids and can represent the connectivity information as a sparse communication graph: graph nodes (partitions) represent bulks of connected grid cells and each graph edge (RLE connectivity information) a unique relation between adjacent partitions. This directly leads to an efficient shared-memory parallelization with graph nodes assigned to computing cores and an efficient en bloc data exchange via graph edges. We further refer to such a partitioning approach with RLE meta information as a cluster-based domain decomposition and to each partition as a cluster. With the sparse communication graph in mind, we then extend the connectivity information represented by the graph edges with MPI ranks, yielding an en bloc communication for distributed-memory systems and a hybrid parallelization. For data migration, the stack-based intra-cluster communication allows a very low memory footprint for data migration and the RLE leads to efficient updates of connectivity information. Our benchmark is based on a shallow water simulation on a dynamically adaptive grid. We conducted performance studies for MPI-only and hybrid parallelizations, yielding an efficiency of over 90% on 256 cores. Furthermore, we demonstrate the applicability of cluster-based optimizations on distributed-memory systems.We like to thank the Munich Centre of Advanced Computing for for funding this project by providing computing time on the MAC Cluster. This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre ”Invasive Computing” (SFB/TR 89)

    Invasive compute balancing for applications with shared and hybrid parallelization

    Get PDF
    This is the author manuscript. The final version is available from the publisher via the DOI in this record.Achieving high scalability with dynamically adaptive algorithms in high-performance computing (HPC) is a non-trivial task. The invasive paradigm using compute migration represents an efficient alternative to classical data migration approaches for such algorithms in HPC. We present a core-distribution scheduler which realizes the migration of computational power by distributing the cores depending on the requirements specified by one or more parallel program instances. We validate our approach with different benchmark suites for simulations with artificial workload as well as applications based on dynamically adaptive shallow water simulations, and investigate concurrently executed adaptivity parameter studies on realistic Tsunami simulations. The invasive approach results in significantly faster overall execution times and higher hardware utilization than alternative approaches. A dynamic resource management is therefore mandatory for a more efficient execution of scenarios similar to our simulations, e.g. several Tsunami simulations in urgent computing, to overcome strong scalability challenges in the area of HPC. The optimizations obtained by invasive migration of cores can be generalized to similar classes of algorithms with dynamic resource requirements.This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre ”Invasive Computing” (SFB/TR 89)

    Evaluation of an efficient etack-RLE clustering concept for dynamically adaptive grids

    Get PDF
    This is the author accepted manuscript. The final version is available from the Society for Industrial and Applied Mathematics via the DOI in this record.Abstract. One approach to tackle the challenge of efficient implementations for parallel PDE simulations on dynamically changing grids is the usage of space-filling curves (SFC). While SFC algorithms possess advantageous properties such as low memory requirements and close-to-optimal partitioning approaches with linear complexity, they require efficient communication strategies for keeping and utilizing the connectivity information, in particular for dynamically changing grids. Our approach is to use a sparse communication graph to store the connectivity information and to transfer data block-wise. This permits efficient generation of multiple partitions per memory context (denoted by clustering) which - in combination with a run-length encoding (RLE) - directly leads to elegant solutions for shared, distributed and hybrid parallelization and allows cluster-based optimizations. While previous work focused on specific aspects, we present in this paper an overall compact summary of the stack-RLE clustering approach completed by aspects on the vertex-based communication that ease up understanding the approach. The central contribution of this work is the proof of suitability of the stack-RLE clustering approach for an efficient realization of different, relevant building blocks of Scientific Computing methodology and real-life CSE applications: We show 95% strong scalability for small-scale scalability benchmarks on 512 cores and weak scalability of over 90% on 8192 cores for finite-volume solvers and changing grid structure in every time step; optimizations of simulation data backends by writer tasks; comparisons of analytical benchmarks to analyze the adaptivity criteria; and a Tsunami simulation as a representative real-world showcase of a wave propagation for our approach which reduces the overall workload by 95% for parallel fully-adaptive mesh refinement and, based on a comparison with SFC-ordered regular grid cells, reduces the computation time by a factor of 7.6 with improved results and a factor of 62.2 with results of similar accuracy of buoy station dataThis work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89)

    A Parallel Mesh-Adaptive Framework for Hyperbolic Conservation Laws

    Full text link
    We report on the development of a computational framework for the parallel, mesh-adaptive solution of systems of hyperbolic conservation laws like the time-dependent Euler equations in compressible gas dynamics or Magneto-Hydrodynamics (MHD) and similar models in plasma physics. Local mesh refinement is realized by the recursive bisection of grid blocks along each spatial dimension, implemented numerical schemes include standard finite-differences as well as shock-capturing central schemes, both in connection with Runge-Kutta type integrators. Parallel execution is achieved through a configurable hybrid of POSIX-multi-threading and MPI-distribution with dynamic load balancing. One- two- and three-dimensional test computations for the Euler equations have been carried out and show good parallel scaling behavior. The Racoon framework is currently used to study the formation of singularities in plasmas and fluids.Comment: late submissio

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    A Parallel Adaptive P3M code with Hierarchical Particle Reordering

    Full text link
    We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

    SFC-based Communication Metadata Encoding for Adaptive Mesh

    Get PDF
    This volume of the series “Advances in Parallel Computing” contains the proceedings of the International Conference on Parallel Programming – ParCo 2013 – held from 10 to 13 September 2013 in Garching, Germany. The conference was hosted by the Technische UniversitĂ€t MĂŒnchen (Department of Informatics) and the Leibniz Supercomputing Centre.The present paper studies two adaptive mesh refinement (AMR) codes whose grids rely on recursive subdivison in combination with space-filling curves (SFCs). A non-overlapping domain decomposition based upon these SFCs yields several well-known advantageous properties with respect to communication demands, balancing, and partition connectivity. However, the administration of the meta data, i.e. to track which partitions exchange data in which cardinality, is nontrivial due to the SFC’s fractal meandering and the dynamic adaptivity. We introduce an analysed tree grammar for the meta data that restricts it without loss of information hierarchically along the subdivision tree and applies run length encoding. Hence, its meta data memory footprint is very small, and it can be computed and maintained on-the-fly even for permanently changing grids. It facilitates a forkjoin pattern for shared data parallelism. And it facilitates replicated data parallelism tackling latency and bandwidth constraints respectively due to communication in the background and reduces memory requirements by avoiding adjacency information stored per element. We demonstrate this at hands of shared and distributed parallelized domain decompositions.This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing (SFB/TR 89). It is partially based on work supported by Award No. UK-c0020, made by the King Abdullah University of Science and Technology (KAUST)

    Hardware-aware block size tailoring on adaptive spacetree grids for shallow water waves.

    Get PDF
    Spacetrees are a popular formalism to describe dynamically adaptive Cartesian grids. Though they directly yield an adaptive spatial discretisation, i.e. a mesh, it is often more efficient to augment them by regular Cartesian blocks embedded into the spacetree leaves. This facilitates stencil kernels working efficiently on homogeneous data chunks. The choice of a proper block size, however, is delicate. While large block sizes foster simple loop parallelism, vectorisation, and lead to branch-free compute kernels, they bring along disadvantages. Large blocks restrict the granularity of adaptivity and hence increase the memory footprint and lower the numerical-accuracy-per-byte efficiency. Large block sizes also reduce the block-level concurrency that can be used for dynamic load balancing. In the present paper, we therefore propose a spacetree-block coupling that can dynamically tailor the block size to the compute characteristics. For that purpose, we allow different block sizes per spacetree node. Groups of blocks of the same size are identied automatically throughout the simulation iterations, and a predictor function triggers the replacement of these blocks by one huge, regularly rened block. This predictor can pick up hardware characteristics while the dynamic adaptivity of the fine grid mesh is not constrained. We study such characteristics with a state-of-the-art shallow water solver and examine proper block size choices on AMD Bulldozer and Intel Sandy Bridge processors

    Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework

    Full text link
    We describe a set of lower-level abstractions to improve performance on modern large scale heterogeneous systems. These provide portable access to system- and hardware-dependent features, automatically apply dynamic optimizations at run time, and target stencil-based codes used in finite differencing, finite volume, or block-structured adaptive mesh refinement codes. These abstractions include a novel data structure to manage refinement information for block-structured adaptive mesh refinement, an iterator mechanism to efficiently traverse multi-dimensional arrays in stencil-based codes, and a portable API and implementation for explicit SIMD vectorization. These abstractions can either be employed manually, or be targeted by automated code generation, or be used via support libraries by compilers during code generation. The implementations described below are available in the Cactus framework, and are used e.g. in the Einstein Toolkit for relativistic astrophysics simulations
    • 

    corecore