2 research outputs found
Is the Multigrid Method Fault Tolerant? The Multilevel Case
Computing at the exascale level is expected to be affected by a significantly
higher rate of faults, due to increased component counts as well as power
considerations. Therefore, current day numerical algorithms need to be
reexamined as to determine if they are fault resilient, and which critical
operations need to be safeguarded in order to obtain performance that is close
to the ideal fault-free method.
In a previous paper, a framework for the analysis of random stationary linear
iterations was presented and applied to the two grid method. The present work
is concerned with the multigrid algorithm for the solution of linear systems of
equations, which is widely used on high performance computing systems. It is
shown that the Fault-Prone Multigrid Method is not resilient, unless the
prolongation operation is protected. Strategies for fault detection and
mitigation as well as protection of the prolongation operation are presented
and tested, and a guideline for an optimal choice of parameters is devised.Comment: 25 pages, 10 figure
Is the Multigrid Method Fault Tolerant? The Two-Grid Case
The predicted reduced resiliency of next-generation high performance
computers means that it will become necessary to take into account the effects
of randomly occurring faults on numerical methods. Further, in the event of a
hard fault occurring, a decision has to be made as to what remedial action
should be taken in order to resume the execution of the algorithm. The action
that is chosen can have a dramatic effect on the performance and
characteristics of the scheme. Ideally, the resulting algorithm should be
subjected to the same kind of mathematical analysis that was applied to the
original, deterministic variant.
The purpose of this work is to provide an analysis of the behaviour of the
multigrid algorithm in the presence of faults. Multigrid is arguably the method
of choice for the solution of large-scale linear algebra problems arising from
discretization of partial differential equations and it is of considerable
importance to anticipate its behaviour on an exascale machine. The analysis of
resilience of algorithms is in its infancy and the current work is perhaps the
first to provide a mathematical model for faults and analyse the behaviour of a
state-of-the-art algorithm under the model. It is shown that the Two Grid
Method fails to be resilient to faults. Attention is then turned to identifying
the minimal necessary remedial action required to restore the rate of
convergence to that enjoyed by the ideal fault-free method.Comment: 27 pages and 6 figure