This thesis presents the results of an investigation into the failure detection problem. We consider the specific case of the Quality of Service (QoS) of crash failure detection. In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective.We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one
which has the ability to recover from the crash state. We show that the fail-free run
and the crash-stop run are special cases of the crash-recovery run with mean time to
failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow
the measurement of the recovery speed, and the definition of the completeness property
of a failure detector. Then, the impact of the dependability of the crash-recovery
target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate
probabilistic model of the two-process failure detection system. Then according to
our approximate model, we show how to estimate the failure detector’s parameters to
achieve a required QoS, based on Chen et al.’s NFD-S algorithm analytically, and how
to execute the configuration procedure of this crash-recovery failure detector.In order to make the failure detector adaptive to the target’s crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage. The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself.In order to evaluate our analytical work, we simulate the following failure detection algorithms: the simple heartbeat timeout algorithm, the NFD-S algorithm and the NFDS
algorithm with the lightweight recovery detection protocol, for various values of
MTTF and MTTR. The simulation results show that the dependability of a recoverable
monitored target could have significant impact on the QoS of such a failure detector.
This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics