270 research outputs found
Adaptive control in rollforward recovery for extreme scale multigrid
With the increasing number of compute components, failures in future
exa-scale computer systems are expected to become more frequent. This motivates
the study of novel resilience techniques. Here, we extend a recently proposed
algorithm-based recovery method for multigrid iterations by introducing an
adaptive control. After a fault, the healthy part of the system continues the
iterative solution process, while the solution in the faulty domain is
re-constructed by an asynchronous on-line recovery. The computations in both
the faulty and healthy subdomains must be coordinated in a sensitive way, in
particular, both under and over-solving must be avoided. Both of these waste
computational resources and will therefore increase the overall
time-to-solution. To control the local recovery and guarantee an optimal
re-coupling, we introduce a stopping criterion based on a mathematical error
estimator. It involves hierarchical weighted sums of residuals within the
context of uniformly refined meshes and is well-suited in the context of
parallel high-performance computing. The re-coupling process is steered by
local contributions of the error estimator. We propose and compare two criteria
which differ in their weights. Failure scenarios when solving up to
unknowns on more than 245\,766 parallel processes will be
reported on a state-of-the-art peta-scale supercomputer demonstrating the
robustness of the method
Extending the OpenCHK Model with Advanced Checkpoint Features
One of the major challenges in using extreme scale systems efficiently is to
mitigate the impact of faults. Application-level checkpoint/restart (CR)
methods provide the best trade-off between productivity, robustness, and
performance. There are many solutions implementing CR at the application level.
They all provide advanced I/O capabilities to minimize the overhead introduced
by CR. Nevertheless, there is still room for improvement in terms of
programmability and flexibility, because end-users must manually serialize and
deserialize application state using low-level APIs, modify the flow of the
application to consider restarts, or rewrite CR code whenever the backend
library changes. In this work, we propose a set of compiler directives and
clauses that allow users to specify CR operations in a simple way. Our approach
supports the common CR features provided by all the CR libraries. However, it
can also be extended to support advanced features that are only available in
some CR libraries, such as differential checkpointing, the use of HDF5 format,
and the possibility of using fault-tolerance-dedicated threads. The result of
our evaluation revealed a high increase in programmability. On average, we
reduced the number of lines of code by 71%, 94%, and 64% for FTI, SCR, and
VeloC, respectively, and no additional overhead was perceived using our
solution compared to using the backend libraries directly. Finally, portability
is enhanced because our programming model allows the use of any backend library
without changing any code
ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms
Fault-tolerant distributed applications require mechanisms to recover data
lost via a process failure. On modern cluster systems it is typically
impractical to request replacement resources after such a failure. Therefore,
applications have to continue working with the remaining resources. This
requires redistributing the workload and that the non-failed processes reload
data. We present an algorithmic framework and its C++ library implementation
ReStore for MPI programs that enables recovery of data after process failures.
By storing all required data in memory via an appropriate data distribution and
replication, recovery is substantially faster than with standard checkpointing
schemes that rely on a parallel file system. As the application developer can
specify which data to load, we also support shrinking recovery instead of
recovery using spare compute nodes. We evaluate ReStore in both controlled,
isolated environments and real applications. Our experiments show loading times
of lost input data in the range of milliseconds on up to 24576 processors and a
substantial speedup of the recovery time for the fault-tolerant version of a
widely used bioinformatics application
Keeping checkpoint/restart viable for exascale systems
Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms
Towards a Mini-App for Smoothed Particle Hydrodynamics at Exascale
The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian
method, used in numerical simulations of fluids in astrophysics and
computational fluid dynamics, among many other fields. SPH simulations with
detailed physics represent computationally-demanding calculations. The
parallelization of SPH codes is not trivial due to the absence of a structured
grid. Additionally, the performance of the SPH codes can be, in general,
adversely impacted by several factors, such as multiple time-stepping,
long-range interactions, and/or boundary conditions. This work presents
insights into the current performance and functionalities of three SPH codes:
SPHYNX, ChaNGa, and SPH-flow. These codes are the starting point of an
interdisciplinary co-design project, SPH-EXA, for the development of an
Exascale-ready SPH mini-app. To gain such insights, a rotating square patch
test was implemented as a common test simulation for the three SPH codes and
analyzed on two modern HPC systems. Furthermore, to stress the differences with
the codes stemming from the astrophysics community (SPHYNX and ChaNGa), an
additional test case, the Evrard collapse, has also been carried out. This work
extrapolates the common basic SPH features in the three codes for the purpose
of consolidating them into a pure-SPH, Exascale-ready, optimized, mini-app.
Moreover, the outcome of this serves as direct feedback to the parent codes, to
improve their performance and overall scalability.Comment: 18 pages, 4 figures, 5 tables, 2018 IEEE International Conference on
Cluster Computing proceedings for WRAp1
Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes
This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing. The final authenticated version is available online at: https://doi.org/10.1007/s00354-013-0302-4[Abstract] The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.Ministerio de Ciencia e InnovaciĂłn; TIN2010-16735Galicia. ConsellerĂa de EconomĂa e Industria; 10PXIB105180P
An Optimizing Java Translation Framework for Automated Checkpointing and Strong Mobility
Long-running programs, e.g., in high-performance computing, need to
write periodic checkpoints of their execution state to disk to allow
them to recover from node failure. Manually adding checkpointing code
to an application, however, is very tedious. The mechanisms needed
for writing the execution state of a program to disk and restoring it
are similar to those needed for migrating a running thread or a mobile
object. We have extended a source-to-source translation scheme that
allows the migration of mobile Java objects with running threads to
make it more general and allow it to be used for automated
checkpointing. Our translation scheme allows serializable threads to
be written to disk or migrated with a mobile agent to a remote
machine. The translator generates code that maintains a serializable
run-time stack for each thread as a Java data structure. While this
results in significant run-time overhead, it allows the checkpointing
code to be generated automatically. We improved the locking mechanism
that is needed to protect the run-time stack as well as the translation
scheme. Our experimental results demonstrate an speedup of the
generated code over the original translator and show that the approach
is feasible in practice
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
Abstract—Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. Index Terms—Grid computing, rollback recovery, checkpointing, event logging. Ç
- …