20,888 research outputs found

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)⌈log⁡1/fW⌉)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f≤1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

    Intelligent fault management for the Space Station active thermal control system

    Get PDF
    The Thermal Advanced Automation Project (TAAP) approach and architecture is described for automating the Space Station Freedom (SSF) Active Thermal Control System (ATCS). The baseline functionally and advanced automation techniques for Fault Detection, Isolation, and Recovery (FDIR) will be compared and contrasted. Advanced automation techniques such as rule-based systems and model-based reasoning should be utilized to efficiently control, monitor, and diagnose this extremely complex physical system. TAAP is developing advanced FDIR software for use on the SSF thermal control system. The goal of TAAP is to join Knowledge-Based System (KBS) technology, using a combination of rules and model-based reasoning, with conventional monitoring and control software in order to maximize autonomy of the ATCS. TAAP's predecessor was NASA's Thermal Expert System (TEXSYS) project which was the first large real-time expert system to use both extensive rules and model-based reasoning to control and perform FDIR on a large, complex physical system. TEXSYS showed that a method is needed for safely and inexpensively testing all possible faults of the ATCS, particularly those potentially damaging to the hardware, in order to develop a fully capable FDIR system. TAAP therefore includes the development of a high-fidelity simulation of the thermal control system. The simulation provides realistic, dynamic ATCS behavior and fault insertion capability for software testing without hardware related risks or expense. In addition, thermal engineers will gain greater confidence in the KBS FDIR software than was possible prior to this kind of simulation testing. The TAAP KBS will initially be a ground-based extension of the baseline ATCS monitoring and control software and could be migrated on-board as additional computation resources are made available

    The Art of Fault Injection

    Get PDF
    Classical greek philosopher considered the foremost virtues to be temperance, justice, courage, and prudence. In this paper we relate these cardinal virtues to the correct methodological approaches that researchers should follow when setting up a fault injection experiment. With this work we try to understand where the "straightforward pathway" lies, in order to highlight those common methodological errors that deeply influence the coherency and the meaningfulness of fault injection experiments. Fault injection is like an art, where the success of the experiments depends on a very delicate balance between modeling, creativity, statistics, and patience

    Microprocessor fault-tolerance via on-the-fly partial reconfiguration

    Get PDF
    This paper presents a novel approach to exploit FPGA dynamic partial reconfiguration to improve the fault tolerance of complex microprocessor-based systems, with no need to statically reserve area to host redundant components. The proposed method not only improves the survivability of the system by allowing the online replacement of defective key parts of the processor, but also provides performance graceful degradation by executing in software the tasks that were executed in hardware before a fault and the subsequent reconfiguration happened. The advantage of the proposed approach is that thanks to a hardware hypervisor, the CPU is totally unaware of the reconfiguration happening in real-time, and there's no dependency on the CPU to perform it. As proof of concept a design using this idea has been developed, using the LEON3 open-source processor, synthesized on a Virtex 4 FPG

    An approach to rollback recovery of collaborating mobile agents

    Get PDF
    Fault-tolerance is one of the main problems that must be resolved to improve the adoption of the agents' computing paradigm. In this paper, we analyse the execution model of agent platforms and the significance of the faults affecting their constituent components on the reliable execution of agent-based applications, in order to develop a pragmatic framework for agent systems fault-tolerance. The developed framework deploys a communication-pairs independent check pointing strategy to offer a low-cost, application-transparent model for reliable agent- based computing that covers all possible faults that might invalidate reliable agent execution, migration and communication and maintains the exactly-one execution property

    Fault-tolerant formation driving mechanism designed for heterogeneous MAVs-UGVs groups

    Get PDF
    A fault-tolerant method for stabilization and navigation of 3D heterogeneous formations is proposed in this paper. The presented Model Predictive Control (MPC) based approach enables to deploy compact formations of closely cooperating autonomous aerial and ground robots in surveillance scenarios without the necessity of a precise external localization. Instead, the proposed method relies on a top-view visual relative localization provided by the micro aerial vehicles flying above the ground robots and on a simple yet stable visual based navigation using images from an onboard monocular camera. The MPC based schema together with a fault detection and recovery mechanism provide a robust solution applicable in complex environments with static and dynamic obstacles. The core of the proposed leader-follower based formation driving method consists in a representation of the entire 3D formation as a convex hull projected along a desired path that has to be followed by the group. Such an approach provides non-collision solution and respects requirements of the direct visibility between the team members. The uninterrupted visibility is crucial for the employed top-view localization and therefore for the stabilization of the group. The proposed formation driving method and the fault recovery mechanisms are verified by simulations and hardware experiments presented in the paper

    Fault Injection for Embedded Microprocessor-based Systems

    Get PDF
    Microprocessor-based embedded systems are increasingly used to control safety-critical systems (e.g., air and railway traffic control, nuclear plant control, aircraft and car control). In this case, fault tolerance mechanisms are introduced at the hardware and software level. Debugging and verifying the correct design and implementation of these mechanisms ask for effective environments, and Fault Injection represents a viable solution for their implementation. In this paper we present a Fault Injection environment, named FlexFI, suitable to assess the correctness of the design and implementation of the hardware and software mechanisms existing in embedded microprocessor-based systems, and to compute the fault coverage they provide. The paper describes and analyzes different solutions for implementing the most critical modules, which differ in terms of cost, speed, and intrusiveness in the original system behavio

    PI-based controller for low-power distributed inverters to maximise reactive current injection while avoiding over voltage during voltage sags

    Get PDF
    This paper is a postprint of a paper submitted to and accepted for publication in IET Power Electronics and is subject to Institution of Engineering and Technology Copyright. The copy of record is available at the IET Digital Library.In the recently deregulated power system scenario, the growing number of distributed generation sources should be considered as an opportunity to improve stability and power quality along the grid. To make progress in this direction, this work proposes a reactive current injection control scheme for distributed inverters under voltage sags. During the sag, the inverter injects, at least, the minimum amount of reactive current required by the grid code. The flexible reactive power injection ensures that one phase current is maintained at its maximum rated value, providing maximum support to the most faulted phase voltage. In addition, active power curtailment occurs only to satisfy the grid code reactive current requirements. As well as, a voltage control loop is implemented to avoid overvoltage in non-faulty phases, which otherwise would probably occur due to the injection of reactive current into an inductive grid. The controller is proposed for low-power rating distributed inverters where conventional voltage support provided by large power plants is not available. The implementation of the controller provides a low computational burden because conventional PI-based control loops may apply. Selected experimental results are reported in order to validate the effectiveness of the proposed control scheme.Peer ReviewedPostprint (updated version
    • …
    corecore