With the increasing number of compute components, failures in future
exa-scale computer systems are expected to become more frequent. This motivates
the study of novel resilience techniques. Here, we extend a recently proposed
algorithm-based recovery method for multigrid iterations by introducing an
adaptive control. After a fault, the healthy part of the system continues the
iterative solution process, while the solution in the faulty domain is
re-constructed by an asynchronous on-line recovery. The computations in both
the faulty and healthy subdomains must be coordinated in a sensitive way, in
particular, both under and over-solving must be avoided. Both of these waste
computational resources and will therefore increase the overall
time-to-solution. To control the local recovery and guarantee an optimal
re-coupling, we introduce a stopping criterion based on a mathematical error
estimator. It involves hierarchical weighted sums of residuals within the
context of uniformly refined meshes and is well-suited in the context of
parallel high-performance computing. The re-coupling process is steered by
local contributions of the error estimator. We propose and compare two criteria
which differ in their weights. Failure scenarios when solving up to
6.9⋅1011 unknowns on more than 245\,766 parallel processes will be
reported on a state-of-the-art peta-scale supercomputer demonstrating the
robustness of the method