33,758 research outputs found

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    Hierarchical-Structure-Based Fault Estimation and Fault-Tolerant Control for Multiagent Systems

    Get PDF
    This paper proposes a hierarchical-structure-based fault estimation and fault-tolerant control design with bidirectional interactions for nonlinear multiagent systems with actuator faults. The hierarchical structure consists of distributed multiagent system hierarchy, undirected topology hierarchy, decentralized fault estimation hierarchy, and distributed fault-tolerant control hierarchy. The states and faults of the system are estimated simultaneously by merging the unknown input observer in a decentralized fashion. The distributed-constant-gain-based and node-based fault-tolerant control schemes are developed to guarantee the asymptotic stability and H-infinity performance of multiagent systems, respectively, based on the estimated information in the fault estimation hierarchy and the relative output information from neighbors. Two simulation cases validate the efficiency of the proposed hierarchical structure control algorithm

    Distributed Fault-Tolerant Consensus Tracking Control of Multi-Agent Systems under Fixed and Switching Topologies

    Get PDF
    This paper proposes a novel distributed fault-tolerant consensus tracking control design for multi-agent systems with abrupt and incipient actuator faults under fixed and switching topologies. The fault and state information of each individual agent is estimated by merging unknown input observer in the decentralized fault estimation hierarchy. Then, two kinds of distributed fault-tolerant consensus tracking control schemes with average dwelling time technique are developed to guarantee the mean-square exponential consensus convergence of multi-agent systems, respectively, on the basis of the relative neighboring output information as well as the estimated information in fault estimation. Simulation results demonstrate the effectiveness of the proposed fault-tolerant consensus tracking control algorithm

    Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

    Full text link
    This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

    A Dual Digraph Approach for Leaderless Atomic Broadcast (Extended Version)

    Full text link
    Many distributed systems work on a common shared state; in such systems, distributed agreement is necessary for consistency. With an increasing number of servers, these systems become more susceptible to single-server failures, increasing the relevance of fault-tolerance. Atomic broadcast enables fault-tolerant distributed agreement, yet it is costly to solve. Most practical algorithms entail linear work per broadcast message. AllConcur -- a leaderless approach -- reduces the work, by connecting the servers via a sparse resilient overlay network; yet, this resiliency entails redundancy, limiting the reduction of work. In this paper, we propose AllConcur+, an atomic broadcast algorithm that lifts this limitation: During intervals with no failures, it achieves minimal work by using a redundancy-free overlay network. When failures do occur, it automatically recovers by switching to a resilient overlay network. In our performance evaluation of non-failure scenarios, AllConcur+ achieves comparable throughput to AllGather -- a non-fault-tolerant distributed agreement algorithm -- and outperforms AllConcur, LCR and Libpaxos both in terms of throughput and latency. Furthermore, our evaluation of failure scenarios shows that AllConcur+'s expected performance is robust with regard to occasional failures. Thus, for realistic use cases, leveraging redundancy-free distributed agreement during intervals with no failures improves performance significantly.Comment: Overview: 24 pages, 6 sections, 3 appendices, 8 figures, 3 tables. Modifications from previous version: extended the evaluation of AllConcur+ with a simulation of a multiple datacenters deploymen

    Distributed Fault Estimation and Fault-Tolerant Control of Interconnected Systems

    Get PDF
    This paper studies distributed fault estimation and fault-tolerant control for continuous-time interconnected systems. Using associated information among subsystems to design the distributed fault estimation observer can improve the accuracy of fault estimation of interconnected systems. Based on static output feedback, the global outputs of interconnected systems are used to construct a distributed fault-tolerant control. The multi-constrained methods are proposed to enhance the transient performance and ability to suppress external disturbances simultaneously. The conditions of the presented design techniques are expressed in terms of linear matrix inequalities. Simulation results are illustrated to show the feasibility of the presented approaches

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    Design and simulation of advanced fault tolerant flight control schemes

    Get PDF
    This research effort describes the design and simulation of a distributed Neural Network (NN) based fault tolerant flight control scheme and the interface of the scheme within a simulation/visualization environment. The goal of the fault tolerant flight control scheme is to recover an aircraft from failures to its sensors or actuators. A commercially available simulation package, Aviator Visual Design Simulator (AVDS), was used for the purpose of simulation and visualization of the aircraft dynamics and the performance of the control schemes.;For the purpose of the sensor failure detection, identification and accommodation (SFDIA) task, it is assumed that the pitch, roll and yaw rate gyros onboard are without physical redundancy. The task is accomplished through the use of a Main Neural Network (MNN) and a set of three De-Centralized Neural Networks (DNNs), providing analytical redundancy for the pitch, roll and yaw gyros. The purpose of the MNN is to detect a sensor failure while the purpose of the DNNs is to identify the failed sensor and then to provide failure accommodation. The actuator failure detection, identification and accommodation (AFDIA) scheme also features the MNN, for detection of actuator failures, along with three Neural Network Controllers (NNCs) for providing the compensating control surface deflections to neutralize the failure induced pitching, rolling and yawing moments. All NNs continue to train on-line, in addition to an offline trained baseline network structure, using the Extended Back-Propagation Algorithm (EBPA), with the flight data provided by the AVDS simulation package.;The above mentioned adaptive flight control schemes have been traditionally implemented sequentially on a single computer. This research addresses the implementation of these fault tolerant flight control schemes on parallel and distributed computer architectures, using Berkeley Software Distribution (BSD) sockets and Message Passing Interface (MPI) for inter-process communication

    A distributed scenario-based stochastic MPC for fault-tolerant microgrid energy management

    Get PDF
    This paper proposes a fault-tolerant energy management algorithm for microgrid systems composed of several agents. The method stems from the necessity to design an algorithm that takes explicitly into account the possibility of faults and their consequences to avoid solutions which are excessively conservative. A tree of possible fault scenarios is built in a completely distributed way by all the agents of the network; then the resulting optimization problem is solved through a distributed algorithm which not only does not require a high computational power for each agent, but keeps also private all local data and decision variables. The effectiveness of the proposed method is proved through simulation results

    Wavelet analysis and consensus algorithm-based fault-tolerant control for smart grids

    Get PDF
    In this paper, the voltage and frequency regulation problems are investigated for smart grids under the influence of faults. To solve those problem, a wavelet analysis and consensus algorithm-based fault-tolerant control scheme is proposed. Specifically, the wavelet analysis technique is introduced to determine whether there exist faults or not in the smart grids. Then, a distributed fault estimator is designed to estimate the attack signals. Based on this estimator state, a distributed fault-tolerant controller is designed to compensate for the faults. It is theoretically shown that the developed method can achieve the voltage regulation and frequency objectives. Finally, a smart grid with four distributed generations is constructed in MATLAB/Simulink for simulation to validate the effectiveness
    • …
    corecore