121,352 research outputs found

    A distributed networked approach for fault detection of large-scale systems

    Get PDF
    Networked systems present some key new challenges in the development of fault diagnosis architectures. This paper proposes a novel distributed networked fault detection methodology for large-scale interconnected systems. The proposed formulation incorporates a synchronization methodology with a filtering approach in order to reduce the effect of measurement noise and time delays on the fault detection performance. The proposed approach allows the monitoring of multi-rate systems, where asynchronous and delayed measurements are available. This is achieved through the development of a virtual sensor scheme with a model-based re-synchronization algorithm and a delay compensation strategy for distributed fault diagnostic units. The monitoring architecture exploits an adaptive approximator with learning capabilities for handling uncertainties in the interconnection dynamics. A consensus-based estimator with timevarying weights is introduced, for improving fault detectability in the case of variables shared among more than one subsystem. Furthermore, time-varying threshold functions are designed to prevent false-positive alarms. Analytical fault detectability sufficient conditions are derived and extensive simulation results are presented to illustrate the effectiveness of the distributed fault detection technique

    Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

    Full text link
    This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

    Towards Distributed and Adaptive Detection and Localisation of Network Faults

    Get PDF
    We present a statistical probing-approach to distributed fault-detection in networked systems, based on autonomous configuration of algorithm parameters. Statistical modelling is used for detection and localisation of network faults. A detected fault is isolated to a node or link by collaborative fault-localisation. From local measurements obtained through probing between nodes, probe response delay and packet drop are modelled via parameter estimation for each link. Estimated model parameters are used for autonomous configuration of algorithm parameters, related to probe intervals and detection mechanisms. Expected fault-detection performance is formulated as a cost instead of specific parameter values, significantly reducing configuration efforts in a distributed system. The benefit offered by using our algorithm is fault-detection with increased certainty based on local measurements, compared to other methods not taking observed network conditions into account. We investigate the algorithm performance for varying user parameters and failure conditions. The simulation results indicate that more than 95 % of the generated faults can be detected with few false alarms. At least 80 % of the link faults and 65 % of the node faults are correctly localised. The performance can be improved by parameter adjustments and by using alternative paths for communication of algorithm control messages

    A Dual Digraph Approach for Leaderless Atomic Broadcast (Extended Version)

    Full text link
    Many distributed systems work on a common shared state; in such systems, distributed agreement is necessary for consistency. With an increasing number of servers, these systems become more susceptible to single-server failures, increasing the relevance of fault-tolerance. Atomic broadcast enables fault-tolerant distributed agreement, yet it is costly to solve. Most practical algorithms entail linear work per broadcast message. AllConcur -- a leaderless approach -- reduces the work, by connecting the servers via a sparse resilient overlay network; yet, this resiliency entails redundancy, limiting the reduction of work. In this paper, we propose AllConcur+, an atomic broadcast algorithm that lifts this limitation: During intervals with no failures, it achieves minimal work by using a redundancy-free overlay network. When failures do occur, it automatically recovers by switching to a resilient overlay network. In our performance evaluation of non-failure scenarios, AllConcur+ achieves comparable throughput to AllGather -- a non-fault-tolerant distributed agreement algorithm -- and outperforms AllConcur, LCR and Libpaxos both in terms of throughput and latency. Furthermore, our evaluation of failure scenarios shows that AllConcur+'s expected performance is robust with regard to occasional failures. Thus, for realistic use cases, leveraging redundancy-free distributed agreement during intervals with no failures improves performance significantly.Comment: Overview: 24 pages, 6 sections, 3 appendices, 8 figures, 3 tables. Modifications from previous version: extended the evaluation of AllConcur+ with a simulation of a multiple datacenters deploymen

    Hierarchical-Structure-Based Fault Estimation and Fault-Tolerant Control for Multiagent Systems

    Get PDF
    This paper proposes a hierarchical-structure-based fault estimation and fault-tolerant control design with bidirectional interactions for nonlinear multiagent systems with actuator faults. The hierarchical structure consists of distributed multiagent system hierarchy, undirected topology hierarchy, decentralized fault estimation hierarchy, and distributed fault-tolerant control hierarchy. The states and faults of the system are estimated simultaneously by merging the unknown input observer in a decentralized fashion. The distributed-constant-gain-based and node-based fault-tolerant control schemes are developed to guarantee the asymptotic stability and H-infinity performance of multiagent systems, respectively, based on the estimated information in the fault estimation hierarchy and the relative output information from neighbors. Two simulation cases validate the efficiency of the proposed hierarchical structure control algorithm

    Thread-level Parallelism in Fault Simulation of Deep Neural Networks on Multi-Processor Systems

    Get PDF
    High-performance fault simulation is one of the essential and preliminary tasks in the process of online and offline testing of machine learning (ML) hardware. Deep neural networks (DNN), as one of the essential parts of ML programs, are widely used in many critical and non-critical applications in Systems-on-Chip and ASIC designs. Through fault simulation for DNNs, by increasing the number of neurons, the fault simulation time increases exponentially. However, the software architecture of neural networks and the lack of dependency between neurons in each inference layer provide significant opportunity for parallelism of the fault simulation time in a multi-processor platform. In this paper, a multi-thread technique for hierarchical fault simulation of neural network is proposed, targeting both permanent and transient faults. During the process of fault simulation the neurons for each inference layer will be distributed among the executing threads. Since in the process of hierarchical fault simulation, the faulty neuron demands proportionally enormous computation comparing to behavioural model of non-faulty neurons, the faulty neuron will be assigned to one thread while the rest of the neurons will be divided among the remaining threads. Experimental results confirm the time efficiency of the proposed fault simulation technique on multi-processor architectures

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    A distributed fault-detection and diagnosis system using on-line parameter estimation

    Get PDF
    The development of a model-based fault-detection and diagnosis system (FDD) is reviewed. The system can be used as an integral part of an intelligent control system. It determines the faults of a system from comparison of the measurements of the system with a priori information represented by the model of the system. The method of modeling a complex system is described and a description of diagnosis models which include process faults is presented. There are three distinct classes of fault modes covered by the system performance model equation: actuator faults, sensor faults, and performance degradation. A system equation for a complete model that describes all three classes of faults is given. The strategy for detecting the fault and estimating the fault parameters using a distributed on-line parameter identification scheme is presented. A two-step approach is proposed. The first step is composed of a group of hypothesis testing modules, (HTM) in parallel processing to test each class of faults. The second step is the fault diagnosis module which checks all the information obtained from the HTM level, isolates the fault, and determines its magnitude. The proposed FDD system was demonstrated by applying it to detect actuator and sensor faults added to a simulation of the Space Shuttle Main Engine. The simulation results show that the proposed FDD system can adequately detect the faults and estimate their magnitudes

    Distributed Fault-Tolerant Consensus Tracking Control of Multi-Agent Systems under Fixed and Switching Topologies

    Get PDF
    This paper proposes a novel distributed fault-tolerant consensus tracking control design for multi-agent systems with abrupt and incipient actuator faults under fixed and switching topologies. The fault and state information of each individual agent is estimated by merging unknown input observer in the decentralized fault estimation hierarchy. Then, two kinds of distributed fault-tolerant consensus tracking control schemes with average dwelling time technique are developed to guarantee the mean-square exponential consensus convergence of multi-agent systems, respectively, on the basis of the relative neighboring output information as well as the estimated information in fault estimation. Simulation results demonstrate the effectiveness of the proposed fault-tolerant consensus tracking control algorithm
    • …
    corecore