Location of Repository


By Israel Hsu, Andrew Gallagher, Michael Le and Yuval Tamir


Asynchronous events and complex system state distributed across independent nodes make exposure and diagnosis of flaws in distributed systems a challenge. The difficulties are exacerbated when the goal is to validate fault tolerance mechanisms that are activated only by the occurrence of errors, which are, by nature, rare. Validation of fault tolerance mechanisms is often done by injecting faults that emulate the actual faults and ‘‘stress’ ’ the functionality of the resilience mechanisms. Validation campaigns lasting days and involving thousands of fault injections are often necessary. We present an infrastructure that combines virtualization and software-implemented fault injection to automate validation campaigns and support the analysis of the behavior of a distributed system under test. Virtualization enables: 1) aflexible fault injector capable of emulating a wide variety of faults, and 2) amechanism for autonomously recovering faulty nodes so that the campaign can continue running on a target system that is fully functional. As acase study we use this infrastructure to validate a Byzantine-fault-tolerant cluster manager. Over 1280 hours of fault injections yielded the exposure of 11 unique flaws in the cluster manager

Topics: KEY WORDS Fault Injection, Dependability, Validation Tools
Year: 2010
OAI identifier: oai:CiteSeerX.psu:
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.cs.ucla.edu/%7Etami... (external link)
  • Suggested articles

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.