17,442 research outputs found

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    Pinwheel Scheduling for Fault-tolerant Broadcast Disks in Real-time Database Systems

    Full text link
    The design of programs for broadcast disks which incorporate real-time and fault-tolerance requirements is considered. A generalized model for real-time fault-tolerant broadcast disks is defined. It is shown that designing programs for broadcast disks specified in this model is closely related to the scheduling of pinwheel task systems. Some new results in pinwheel scheduling theory are derived, which facilitate the efficient generation of real-time fault-tolerant broadcast disk programs.National Science Foundation (CCR-9308344, CCR-9596282

    Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

    Full text link
    This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    An optimal fixed-priority assignment algorithm for supporting fault-tolerant hard real-time systems

    Get PDF
    The main contribution of this paper is twofold. First, we present an appropriate schedulability analysis, based on response time analysis, for supporting fault-tolerant hard real-time systems. We consider systems that make use of error-recovery techniques to carry out fault tolerance. Second, we propose a new priority assignment algorithm which can be used, together with the schedulability analysis, to improve system fault resilience. These achievements come from the observation that traditional priority assignment policies may no longer be appropriate when faults are being considered. The proposed schedulability analysis takes into account the fact that the recoveries of tasks may be executed at higher priority levels. This characteristic is very important since, after an error, a task certainly has a shorter period of time to meet its deadline. The proposed priority assignment algorithm, which uses some properties of the analysis, is very efficient. We show that the method used to find out an appropriate priority assignment reduces the search space from O(n!) to O(n/sup 2/), where n is the number of task recovery procedures. Also, we show that the priority assignment algorithm is optimal in the sense that the fault resilience of task sets is maximized as for the proposed analysis. The effectiveness of the proposed approach is evaluated by simulation

    Smart Power Grid Synchronization With Fault Tolerant Nonlinear Estimation

    Get PDF
    Effective real-time state estimation is essential for smart grid synchronization, as electricity demand continues to grow, and renewable energy resources increase their penetration into the grid. In order to provide a more reliable state estimation technique to address the problem of bad data in the PMU-based power synchronization, this paper presents a novel nonlinear estimation framework to dynamically track frequency, voltage magnitudes and phase angles. Instead of directly analyzing in abc coordinate frame, symmetrical component transformation is employed to separate the positive, negative, and zero sequence networks. Then, Clarke\u27s transformation is used to transform the sequence networks into the αβ stationary coordinate frame, which leads to system model formulation. A novel fault tolerant extended Kalman filter based real-time estimation framework is proposed for smart grid synchronization with noisy bad data measurements. Computer simulation studies have demonstrated that the proposed fault tolerant extended Kalman filter (FTEKF) provides more accurate voltage synchronization results than the extended Kalman filter (EKF). The proposed approach has been implemented with dSPACE DS1103 and National Instruments CompactRIO hardware platforms. Computer simulation and hardware instrumentation results have shown the potential applications of FTEKF in smart grid synchronization
    corecore