2 research outputs found

    High Performance Fault-Tolerant Solution of PDEs using the Sparse Grid Combination Technique

    No full text
    The data volume of Partial Differential Equation (PDE) based ultra-large-scale scientific simulations is increasing at a higher rate than that of the system’s processing power. To process the increased amount of simulation data within a reasonable amount of time, the evolution of computation is expected to reach the exascale level. One of several key challenges to overcome in these exascale systems is to handle the high rate of component failure arising due to having millions of cores working together with high power consumption and clock frequencies. Studies show that even the highly tuned widely used checkpointing technique is unable to handle the failures efficiently in exascale systems. The Sparse Grid Combination Technique (SGCT) is proved to be a cost-effective method for computing high-dimensional PDE based simulations with only small loss of accuracy, which can be easily modified to provide an Algorithm-Based Fault Tolerance (ABFT) for these applications. Additionally, the recently introduced User Level Failure Mitigation (ULFM) MPI library provides the ability to detect and identify application process failures, and reconstruct the failed processes. However, there is a gap of the research how these could be integrated together to develop fault-tolerant applications, and the range of issues that may arise in the process are yet to be revealed. My thesis is that with suitable infrastructural support an integration of ULFM MPI and a modified form of the SGCT can be used to create high performance robust PDE based applications. The key contributions of my thesis are: (1) An evaluation of the effectiveness of applying the modified version of the SGCT on three existing and complex applications (including a general advection solver) to make them highly fault-tolerant. (2) An evaluation of the capabilities of ULFM MPI to recover from a single or multiple real process/node failures for a range of complex applications computed with the modified form of the SGCT. (3) A detailed experimental evaluation of the fault-tolerant work including the time and space requirements, and parallelization on the non-SGCT dimensions. (4) An analysis of the result errors with respect to the number of failures. (5) An analysis of the ABFT and recovery overheads. (6) An in-depth comparison of the fault-tolerant SGCT based ABFT with traditional checkpointing on a non-fault-tolerant SGCT based application. (7) A detailed evaluation of the infrastructural support in terms of load balancing, pure- and hybrid-MPI, process layouts, processor affinity, and so on

    A massively parallel combination technique for the solution of high-dimensional PDEs

    Get PDF
    The solution of high-dimensional problems, especially high-dimensional partial differential equations (PDEs) that require the joint discretization of more than the usual three spatial dimensions and time, is one of the grand challenges in high performance computing (HPC). Due to the exponential growth of the number of unknowns - the so-called curse of dimensionality, it is in many cases not feasible to resolve the simulation domain as fine as required by the physical problem. Although the upcoming generation of exascale HPC systems theoretically provides the computational power to handle simulations that are out of reach today, it is expected that this is only achievable with new numerical algorithms that are able to efficiently exploit the massive parallelism of these systems. The sparse grid combination technique is a numerical scheme where the problem (e.g., a high-dimensional PDE) is solved on different coarse and anisotropic computational grids (so-called component grids), which are then combined to approximate the solution with a much higher target resolution than any of the individual component grids. This way, the total number of unknowns being computed is drastically reduced compared to the case when the problem is directly solved on a regular grid with the target resolution. Thus, the curse of dimensionality is mitigated. The combination technique is a promising approach to solve high-dimensional problems on future exascale systems. It offers two levels of parallelism: the component grids can be computed in parallel, independently and asynchronously of each other; and the computation of each component grid can be parallelized as well. This reduces the demand for global communication and synchronization, which is expected to be one of the limiting factors for classical discretization techniques to achieve scalability on exascale systems. Furthermore, the combination technique enables novel approaches to deal with the increasing fault rates expected from these systems. With the fault-tolerant combination technique it is possible to recover from failures without time-consuming checkpoint-restart mechanisms. In this work, new algorithms and data structures are presented that enable a massively parallel and fault-tolerant combination technique for time-dependent PDEs on large-scale HPC systems. The scalability of these algorithms is demonstrated on up to 180225 processor cores on the supercomputer Hazel Hen. Furthermore, the parallel combination technique is applied to gyrokinetic simulations in GENE, a software for the simulation of plasma microturbulence in fusion devices
    corecore