Characterization of Failure Handling in Fault-Tolerant Multiprocessor Systems.

Abstract

Traditional reliability-related models for fault-tolerant systems are used to predict system reliability, availability, computation capacity, or performability. They lack the capacity to treat in detail the handling and the consequences of failure. Also, there is insufficient attention paid to the fact that a system crash could follow any mishandling of failure. Failure h and ling consists of three major steps: error detection, system reconfiguration, and computation recovery. These steps must be considered together as a single package, not as separate entities as in the traditional analyses. Such an integration can be extended to develop design aids for fault-tolerant computers. The dissertation begins with the modeling of fault/error detection mechanisms which are designed to identify faulty units. When fault latency and /or error latency exist, the system may suffer from the propagation of errors and the accumulation of extant faults which will seriously reduce the fault-tolerant capability. Several detection models are developed so that we can study the effect of detection mechanisms on the subsequent error h and ling and overall system reliability. Upon detection of a faulty unit, the system should reconfigure itself into an optimal configuration so that the total reward to be achieved from the subsequent executions may be maximized. Finally, the contaminated processes have to be recovered. The strategies of error recovery employed will depend on the detection mechanisms and the redundancy available. Several recovery methods, especially retry and rollback, are analyzed. The recovery overheads are evaluated, providing an index of the capabilities of the detection and reconfiguration mechanisms.Ph.D.Computer scienceUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/160550/1/8512452.pd

    Similar works