slides

Architecting fault-tolerant software systems

Abstract

The increasing size and complexity of software systems makes it hard to prevent or remove all possible faults. Faults that remain in the system can eventually lead to a system failure. Fault tolerance techniques are introduced for enabling systems to recover and continue operation when they are subject to faults. Many fault tolerance techniques are available but incorporating them in a system is not always trivial. We consider the following problems in designing a fault-tolerant system. First, existing reliability analysis techniques generally do not prioritize potential failures from the end-user perspective and accordingly do not identify sensitivity points of a system. \ud Second, existing architecture styles are not well-suited for specifying, communicating and analyzing design decisions that are particularly related to the fault-tolerant aspects of a system. Third, there are no adequate analysis techniques that evaluate the impact of fault tolerance techniques on the functional decomposition of software architecture. Fourth, realizing a fault-tolerant design usually requires a substantial development and maintenance effort. \ud To tackle the first problem, we propose a scenario-based software architecture reliability analysis method, called SARAH that benefits from mature reliability engineering techniques (i.e. FMEA, FTA) to provide an early reliability analysis of the software architecture design. SARAH evaluates potential failures from the end-user perspective to identify sensitive points of a system without requiring an implementation. \ud As a new architectural style, we introduce Recovery Style for specifying fault-tolerant aspects of software architecture. Recovery Style is used for communicating and analyzing architectural design decisions and for supporting detailed design with respect to recovery. \ud As a solution for the third problem, we propose a systematic method for optimizing the decomposition of software architecture for local recovery, which is an effective fault tolerance technique to attain high system availability. To support the method, we have developed an integrated set of tools that employ optimization techniques, state-based analytical models (i.e. CTMCs) and dynamic analysis on the system. The method enables the following: i ) modeling the design space of the possible decomposition alternatives, ii ) reducing the design space with respect to domain and stakeholder constraints and iii ) making the desired trade-off between availability and performance metrics. \ud To reduce the development and maintenance effort, we propose a framework, FLORA that supports the decomposition and implementation of software architecture for local recovery. The framework provides reusable abstractions for defining recoverable units and for incorporating the necessary coordination and communication protocols for recovery

    Similar works