6 research outputs found
Lazy Checkpoint Coordination for Bounding Rollback Propagation
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128
SpECTRE: A Task-based Discontinuous Galerkin Code for Relativistic Astrophysics
We introduce a new relativistic astrophysics code, SpECTRE, that combines a
discontinuous Galerkin method with a task-based parallelism model. SpECTRE's
goal is to achieve more accurate solutions for challenging relativistic
astrophysics problems such as core-collapse supernovae and binary neutron star
mergers. The robustness of the discontinuous Galerkin method allows for the use
of high-resolution shock capturing methods in regions where (relativistic)
shocks are found, while exploiting high-order accuracy in smooth regions. A
task-based parallelism model allows efficient use of the largest supercomputers
for problems with a heterogeneous workload over disparate spatial and temporal
scales. We argue that the locality and algorithmic structure of discontinuous
Galerkin methods will exhibit good scalability within a task-based parallelism
framework. We demonstrate the code on a wide variety of challenging benchmark
problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the
code's scalability including its strong scaling on the NCSA Blue Waters
supercomputer up to the machine's full capacity of 22,380 nodes using 671,400
threads.Comment: 41 pages, 13 figures, and 7 tables. Ancillary data contains
simulation input file
Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems
Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated