4 research outputs found

    Modeling Software Systems with Rejuvenation, Restoration and Checkpointing through Fluid Stochastic Petri Nets

    No full text
    In this paper, we present a Fluid Stochastic Petri Net (FSPN) based model which captures the behavior of aging software systems with checkpointing, rejuvenation and self-restoration, three well known techniques of software fault tolerance. The proposed FSPN based modeling framework is novel in many aspects. First, the FSPN formalism itself, as proposed in [24], is extended by adding flush-out arcs. Second, the three techniques are simultaneously captured in a single model for the first time. Third, the formalism enables modeling dependencies of the three techniques on various system features such as failure, load and time in the same framework. Further, our base FSPN model can be viewed as a generalization of most previous models in the literature. To demonstrate, we present a set of FSPNs which are simple modifications of the base model. These represent software systems with checkpointing only, rejuvenation only and checkpointing and rejuvenation. We show that these FSPNs ca..

    Modeling software systems with rejuvenation, restoration and checkpointing through fluid stochastic Petri nets

    No full text

    Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms

    Get PDF
    As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences. In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time
    corecore