181,286 research outputs found

    Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing

    Get PDF
    Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To determine if deadlines are met, RTSs are analyzed with respect to average execution time (AET) and worst case execution time (WCET), where AET is used for soft RTSs, and WCET is used for hard RTSs. When fault tolerance is employed in both soft and hard RTSs, the time overhead caused due to usage of fault tolerance may be the reason that deadlines in RTSs are violated. Therefore, there is a need to optimize the usage of fault tolerance in RTSs. To enable correct operation of RTSs in the presence of soft errors, in this thesis we consider a fault tolerance technique, Roll-back Recovery with Checkpointing (RRC), that efficiently copes with soft errors. The major drawback of RRC is that it introduces a time overhead which depends on the number of checkpoints that are used in RRC. Depending on how the checkpoints are distributed throughout the execution of the job, we consider the two checkpointing schemes: equidistant checkpointing, where the checkpoints are evenly distributed, and non-equidistant checkpointing, where the checkpoints are not evenly distributed. The goal of this thesis is to provide an optimization framework for RRC when used in RTSs while considering different optimization objectives which are important for RTSs. The purpose of such an optimization framework is to assist the designer of an RTS during the early design stage, when the designer needs to explore different fault tolerance techniques, and choose a particular fault tolerance technique that meets the specification requirements for the RTS that is to be implemented. By using the optimization framework presented in this thesis, the designer of an RTS can acquire knowledge if RRC is a suitable fault tolerance technique for the RTS which needs to be implemented. The proposed optimization framework includes the following optimization objectives. For soft RTSs, we consider optimization of RRC with respect to AET. For the case of equidistant checkpointing, the optimization framework provides the optimal number of checkpoints resulting in the minimal AET. For non-equidistant checkpointing, the optimization framework provides two adaptive techniques that estimate the probability of errors and adjust the checkpointing scheme (the number of checkpoints over time) with the goal to minimize the AET. While for soft RTSs analyses based on AET are sufficient, for hard RTSs it is more important to maximize the probability that deadlines are met. To evaluate to what extent a deadline is met, in this thesis we have used the statistical concept Level of Confidence (LoC). The LoC with respect to a given deadline defines the probability that a job (or a set of jobs) completes before the given deadline. As a metric, LoC is equally applicable for soft and hard RTSs. However, as an optimization objective LoC is used in hard RTSs. Therefore, for hard RTSs, we consider optimization of RRC with respect to LoC. For equidistant checkpointing, the optimization framework provides (1) for a single job, the optimal number of checkpoints resulting in the maximal LoC with respect to a given deadline, and (2) for a set of jobs running in a sequence and a global deadline, the optimization framework provides the number of checkpoints that should be assigned to each job such that the LoC with respect to the global deadline is maximized. For non-equidistant checkpointing, the optimization framework provides how a given number of checkpoints should be distributed such that the LoC with respect to a given deadline is maximized. Since the specification of an RTS may have a reliability requirement such that all deadlines need to be met with some probability, in this thesis we have introduced the concept Guaranteed Completion Time which refers to a completion time such that the probability that a job completes within this time is at least equal to a given reliability requirement. The optimization framework includes Guaranteed Completion Time as an optimization objective, and with respect to the Guaranteed Completion Time, the framework provides the optimal number of checkpoints, while assuming equidistant checkpointing, that results in the minimal Guaranteed Completion Time

    Addressing performance requirements in the FDT-based design of distributed systems

    Get PDF
    The development of distributed systems is generally regarded as a complex and costly task, and for this reason formal description techniques such as LOTOS and ESTELLE (both standardized by the ISO) are increasingly used in this process. Our experience is that LOTOS can be exploited at many stages on the design trajectory, from requirements specification to implementation, but that the language elements do not allow direct formalization of performance requirements. To avoid duplication of effort by using two formalisms with distinct approaches, we propose a design method that incorporates performance constraints in an heuristic but effective manner

    Intelligent agent for formal modelling of temporal multi-agent systems

    Get PDF
    Software systems are becoming complex and dynamic with the passage of time, and to provide better fault tolerance and resource management they need to have the ability of self-adaptation. Multi-agent systems paradigm is an active area of research for modeling real-time systems. In this research, we have proposed a new agent named SA-ARTIS-agent, which is designed to work in hard real-time temporal constraints with the ability of self-adaptation. This agent can be used for the formal modeling of any self-adaptive real-time multi-agent system. Our agent integrates the MAPE-K feedback loop with ARTIS agent for the provision of self-adaptation. For an unambiguous description, we formally specify our SA-ARTIS-agent using Time-Communicating Object-Z (TCOZ) language. The objective of this research is to provide an intelligent agent with self-adaptive abilities for the execution of tasks with temporal constraints. Previous works in this domain have used Z language which is not expressive to model the distributed communication process of agents. The novelty of our work is that we specified the non-terminating behavior of agents using active class concept of TCOZ and expressed the distributed communication among agents. For communication between active entities, channel communication mechanism of TCOZ is utilized. We demonstrate the effectiveness of the proposed agent using a real-time case study of traffic monitoring system

    Hard Real-Time Networking on Firewire

    Get PDF
    This paper investigates the possibility of using standard, low-cost, widely used FireWire as a new generation fieldbus medium for real-time distributed control applications. A real-time software subsystem, RT-FireWire was designed that can, in combination with Linux-based real-time operating system, provide hard real-time communication over FireWire. In addition, a high-level module that can emulate Ethernet over RT-FireWire was implemented. This additional module enables existing IP-based real-time communication frameworks to work on top of FireWire. The real-time behavior of RT-FireWire was demonstrated with a simple control setup. Furthermore, an outlook of the future development on RT-FireWire is given
    corecore