17 research outputs found

    Communication-Avoiding Algorithms for a High-Performance Hyperbolic PDE Engine

    Get PDF
    The study of waves has always been an important subject of research. Earthquakes, for example, have a direct impact on the daily lives of millions of people while gravitational waves reveal insight into the composition and history of the Universe. These physical phenomena, despite being tackled traditionally by different fields of physics, have in common that they are modelled the same way mathematically: as a system of hyperbolic partial differential equations (PDEs). The ExaHyPE project (“An Exascale Hyperbolic PDE Engine") translates this similarity into a software engine that can be quickly adapted to simulate a wide range of hyperbolic partial differential equations. ExaHyPE’s key idea is that the user only specifies the physics while the engine takes care of the parallelisation and the interplay of the underlying numerical methods. Consequently, a first simulation code for a new hyperbolic PDE can often be realised within a few hours. This is a task that traditionally can take weeks, months, even years for researchers starting from scratch. My main contribution to ExaHyPE is the development of the core infrastructure. This comprises the development and implementation of ExaHyPE’s solvers and adaptive mesh refinement procedures, it’s MPI+X parallelisation as well as high-level aspects of ExaHyPE’s application-tailored code generation, which allows to adapt ExaHyPE to model many different hyperbolic PDE systems. Like any high-performance computing code, ExaHyPE has to tackle the challenges of the coming exascale computing era, notably network communication latencies and the growing memory wall. In this thesis, I propose memory-efficient realisations of ExaHyPE’s solvers that avoid data movement together with a novel task-based MPI+X parallelisation concept that allows to hide network communication behind computation in dynamically adaptive simulations

    teaMPI---replication-based resiliency without the (performance) pain.

    Get PDF
    In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine—a task-based solver for hyperbolic equation systems—that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned “for nothing”. Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing

    Leveraging performance of 3D finite difference schemes in large scientific computing simulations

    Get PDF
    Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atrás quedaron los días en los que ingenieros y científicos realizaban sus experimentos empíricamente. Durante esas décadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos teóricos. Con la llegada de la era computacional, la computación científica se ha convertido en una solución factible comparada con métodos empíricos, en términos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados numéricos gracias al refinamiento del dominio. Diversos métodos numéricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). Métodos como Elementos Finitos (EF) y Volúmenes Finitos (VF) están bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias más altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una solución eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computación científica. Se proponen diferentes técnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con técnicas como spatial- o time-blocking, añadiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-simétricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbación del problema del ancho de memoria. Para paliar este problema, nuestra investigación se centra en estrategias para reducir la presión en la jerarquía de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposición de dominios para balancear la carga asegurando resultados casi óptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con técnicas de spatial-blocking y auto-tuning, explorando el espacio paramétrico y reduciendo los fallos en la cache de último nivel. Como alternativa a los métodos de fuerza bruta usados en auto-tuning donde un espacio paramétrico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una solución factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando parámetros suboptimos casi de forma instantánea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo características complejas como prefetchers, SMT y optimizaciones algorítmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecución, sino también para tomar decisiones de los mejores parámetros algorítmicos. Además, puede ser incluido en optimizadores run-time para decidir la mejor configuración SMT. Algunas industrias confían en técnicas DF para sus códigos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigación. En este aspecto, hemos diseñado e implementado desde cero una infraestructura DF que cubre las características más importantes que una aplicación industrial debe incluir. Algunas de las técnicas de optimización propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estratégicas para la industria: un modelo de transporte atmosférico que simula la dispersión de ceniza volcánica y un modelo de imagen sísmica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo

    ISCR Annual Report: Fical Year 2004

    Full text link

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Efficient computations for multiphase flow problems using coupled lattice Boltzmann-level set methods

    Get PDF
    Multiphase flow simulations benefit a variety of applications in science and engineering as for example in the dynamics of bubble swarms in heat exchangers and chemical reactors or in the prediction of the effects of droplet or bubble impacts in the design of turbomachinery systems. Despite all the progress in the modern computational fluid dynamics (CFD), such simulations still present formidable challenges both from numerical and computational cost point of view. Emerging as a powerful numerical technique in recent years, the lattice Boltzmann method (LBM) exhibits unique numerical and computational features in specific problems for its ability to detect small scale transport phenomena, including those of interparticle forces in multiphase and multicomponent flows, as well as its inherent advantage to deliver favourable computational efficiencies on parallel processors. In this thesis two classes of LB methods for multiphase flow simulations are developed which are coupled with the level set (LS) interface capturing technique. Both techniques are demonstrated to provide high resolution realizations of the interface at large density and viscosity differences within relatively low computational demand and regularity restrictions compared to the conventional phase-field LB models. The first model represents a sharp interface one-fluid formulation, where the LB equation is assigned to solve for a single virtual fluid and the interface is captured through convection of an initially signed distance level set function governed by the level set equation (LSE). The second scheme proposes a diffuse pressure evolution description of the LBE, solving for velocity and dynamic pressure only. In contrast to the common kineticbased solutions of the Cahn-Hilliard equations, the density is then solved via a mass conserving LS equation which benefits from a fast monolithic reinitialization. Rigorous comparisons against established numerical solutions of multiphase NS equations for rising bubble problems are carried out in two and three dimensions, which further provide an unprecedented basis to evaluate the effect of different numerical and implementation aspects of the schemes on the overall performance and accuracy. The simulations are eventually applied to other physically interesting multiphase problems, featuring the flexibility and stability of the scheme under high Re numbers and very large deformations. On the computational side, an efficient implementation of the proposed schemes is presented in particular for manycore general purpose graphical processing units (GPGPU). Various segments of the solution algorithm are then evaluated with respect to their corresponding computational workload and efficient implementation outlines are addressed

    Matrix-free finite-element computations at extreme scale and for challenging applications

    Get PDF
    For numerical computations based on finite element methods (FEM), it is common practice to assemble the system matrix related to the discretized system and to pass this matrix to an iterative solver. However, the assembly step can be costly and the matrix might become locally dense, e.g., in the context of high-order, high-dimensional, or strongly coupled multicomponent FEM, leading to high costs when applying the matrix due to limited bandwidth on modern CPU- and GPU-based hardware. Matrix-free algorithms are a means of accelerating FEM computations on HPC systems, by applying the effect of the system matrix without assembling it. Despite convincing arguments for matrix-free computations as a means of improving performance, their usage still tends to be an exception at the time of writing of this thesis, not least because they have not yet proven their applicability in all areas of computational science, e.g., solid mechanics. In this thesis, we further develop a state-of-the-art matrix-free framework for high-order FEM computations with focus on the preconditioning and adopt it in novel application fields. In the context of high-order FEM, we develop means of improving cache efficiency by interleaving cell loops with vector updates, which we use to increase the throughput of preconditioned conjugate gradient methods and of block smoothers based on additive Schwarz methods; we also propose an algorithm for the fast application of hanging-node constraints in 3D for up to 137 refinement configurations. We develop efficient geometric and polynomial multigrid solvers with optimized transfer operators, whose performance is experimentally investigated in detail in the context of locally refined meshes, indicating the superiority of global-coarsening algorithms. We apply the developed solvers in the context of novel stage-parallel implicit Runge–Kutta methods and demonstrate the benefit of stage–parallel solvers in decreasing the time to solution at the scaling limit. Novel challenging application fields of matrix-free computations include high-dimensional computational plasma physics, solid-state-sintering simulations with a high and dynamically changing number of strongly coupled components, and coupled multiphysics problems with evaluation and integration at arbitrary points. In the context of these fields, we detail computational challenges, propose modified versions of the standard matrix-free algorithms for high-performance computing, and discuss preconditioning-related topics. The efficiency of the derived algorithms on the node level and at extreme scales is demonstrated experimentally on SuperMUC-NG, one of Germany’s leading supercomputers, with up to 150k processes and by solving systems of up to 5 × 1012 unknowns. Such problem sizes would not be conceivable for equivalent matrix-based algorithms. The major achievements of this thesis allow to run larger simulations faster and more efficiently, enabling progress and new possibilities for a range of application fields in computational science

    A multiresolution Discrete Element Method for triangulated objects with implicit time stepping

    Get PDF
    Simulations of many rigid bodies colliding with each other sometimes yield particularly interesting results if the colliding objects differ significantly in size and are nonspherical. The most expensive part within such a simulation code is the collision detection. We propose a family of novel multiscale collision detection algorithms that can be applied to triangulated objects within explicit and implicit time stepping methods. They are well suited to handle objects that cannot be represented by analytical shapes or assemblies of analytical objects. Inspired by multigrid methods and adaptive mesh refinement, we determine collision points iteratively over a resolution hierarchy and combine a functional minimization plus penalty parameters with the actual comparision-based geometric distance calculation. Coarse surrogate geometry representations identify “no collision” scenarios early on and otherwise yield an educated guess which triangle subsets of the next finer level might yield collisions. They prune the search tree and furthermore feed conservative contact force estimates into the iterative solve behind an implicit time stepping. Implicit time stepping and nonanalytical shapes often yield prohibitive high compute cost for rigid body simulations. Our approach reduces the object-object comparison cost algorithmically by one to two orders of magnitude. It also exhibits high vectorization efficiency due to its iterative nature
    corecore