8 research outputs found
Havens: Explicit Reliable Memory Regions for HPC Applications
Supporting error resilience in future exascale-class supercomputing systems
is a critical challenge. Due to transistor scaling trends and increasing memory
density, scientific simulations are expected to experience more interruptions
caused by transient errors in the system memory. Existing hardware-based
detection and recovery techniques will be inadequate to manage the presence of
high memory fault rates.
In this paper we propose a partial memory protection scheme based on
region-based memory management. We define the concept of regions called havens
that provide fault protection for program objects. We provide reliability for
the regions through a software-based parity protection mechanism. Our approach
enables critical program objects to be placed in these havens. The fault
coverage provided by our approach is application agnostic, unlike
algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16),
September 2016, Waltham, MA, US
May 9th, 2017
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency. Few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems.
In this talk, I will present a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems. We have developed a complete catalog of resilience design patterns, which provides designers with a collection of such reusable design elements. We have also defined a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems
Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery
Efficient utilization of today's high-performance computing (HPC) systems
with complex hardware and software components requires that the HPC
applications are designed to tolerate process failures at runtime. With low
mean time to failure (MTTF) of current and future HPC systems, long running
simulations on these systems require capabilities for gracefully handling
process failures by the applications themselves. In this paper, we explore the
use of fault tolerance extensions to Message Passing Interface (MPI) called
user-level failure mitigation (ULFM) for handling process failures without the
need to discard the progress made by the application. We explore two
alternative recovery strategies, which use ULFM along with application-driven
in-memory checkpointing. In the first case, the application is recovered with
only the surviving processes, and in the second case, spares are used to
replace the failed processes, such that the original configuration of the
application is restored. Our experimental results demonstrate that graceful
degradation is a viable alternative for recovery in environments where spares
may not be available.Comment: 26th Euromicro International Conference on Parallel, Distributed and
network-based Processing (PDP 2018