Published In Design Time Reliability Analysis of Distributed Fault Tolerance Algorithms


Designing a distributed fault tolerance algorithm re-quires careful analysis of both fault models and diagnosis strategies. A system will fail if there are too many active faults, especially active Byzantine faults. But, a system will also fail if overly aggressive convictions leave inadequate redundancy. For high reliability, an algorithm’s hybrid fault model and diagnosis strategy must be tuned to the types and rates of faults expected in the real world. We exam-ine this balancing problem for two common types of dis-tributed algorithms: clock synchronization and group mem-bership. We show the importance of choosing a hybrid fault model appropriate for the physical faults expected by con-sidering two clock synchronization algorithms. Three group membership service diagnosis strategies are used to demon-strate the benefit of discriminating between permanent and transient faults. In most cases, the probability of failure is dominated by one fault type. By identifying the dominant cause of failure, one can tailor an algorithm appropriately at design time, yielding significant reliability gain.

Similar works

Full text

oaioai:CiteSeerX.psu: time updated on 10/28/2017

This paper was published in CiteSeerX.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.