Synchronization and distributed agreement in real-time systems.

Abstract

A Byzantine fault is an arbitrary behavior on the part of a hardware component, a software module or a logical entity. The focus of this dissertation is on problems related to designing systems that are resilient to Byzantine faults. This work is important in computer systems that are designed to control critical real-time applications such as avionics, life-support systems, nuclear reactors, automobile engines and process control systems. The main emphasis of the dissertation is on developing techniques for synchronizing system components and ensuring distributed agreement in the presence of Byzantine faults. First, hardware and software schemes are proposed for synchronizing the local clocks in the nodes of a distributed system. These schemes compare favorably with existing schemes in terms of cost as well as tightness of synchronization. Next, a clock distribution scheme is proposed for delivering the clock signal to the components within a node without creating any timing uncertainties. The key aspect of this scheme is that both delay and skew are taken into account when determining the layout of the clock lines. The global time base that results from these two solutions can be used to simplify fault tolerant algorithms for various design problems in real-time systems. This is illustrated by presenting a checkpointing and rollback recovery algorithm that requires considerably less time and space overhead than other algorithms. Finally, a solution is derived for diagnosing components with Byzantine faults. This is useful for reducing the number of faults that need to be tolerated to meet a specified reliability requirement which in turn reduces the overhead imposed by any algorithm that is resilient to Byzantine faults. In short, the work accomplished in this dissertation demonstrates that it is not necessary to limit oneself to small distributed systems in order to achieve a high reliability. The techniques developed here are economical in small and in large distributed systems.Ph.D.Computer scienceElectrical engineeringUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/162512/1/9013992.pd

    Similar works