37,341 research outputs found

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    Case study: Bio-inspired self-adaptive strategy for spike-based PID controller

    Get PDF
    A key requirement for modern large scale neuromorphic systems is the ability to detect and diagnose faults and to explore self-correction strategies. In particular, to perform this under area-constraints which meet scalability requirements of large neuromorphic systems. A bio-inspired online fault detection and self-correction mechanism for neuro-inspired PID controllers is presented in this paper. This strategy employs a fault detection unit for online testing of the PID controller; uses a fault detection manager to perform the detection procedure across multiple controllers, and a controller selection mechanism to select an available fault-free controller to provide a corrective step in restoring system functionality. The novelty of the proposed work is that the fault detection method, using synapse models with excitatory and inhibitory responses, is applied to a robotic spike-based PID controller. The results are presented for robotic motor controllers and show that the proposed bioinspired self-detection and self-correction strategy can detect faults and re-allocate resources to restore the controller’s functionality. In particular, the case study demonstrates the compactness (~1.4% area overhead) of the fault detection mechanism for large scale robotic controllers.Ministerio de Economía y Competitividad TEC2012-37868-C04-0

    Multilevel Clustering Fault Model for IC Manufacture

    Full text link
    A hierarchical approach to the construction of compound distributions for process-induced faults in IC manufacture is proposed. Within this framework, the negative binomial distribution is treated as level-1 models. The hierarchical approach to fault distribution offers an integrated picture of how fault density varies from region to region within a wafer, from wafer to wafer within a batch, and so on. A theory of compound-distribution hierarchies is developed by means of generating functions. A study of correlations, which naturally appears in microelectronics due to the batch character of IC manufacture, is proposed. Taking these correlations into account is of significant importance for developing procedures for statistical quality control in IC manufacture. With respect to applications, hierarchies of yield means and yield probability-density functions are considered.Comment: 10 pages, the International Conference "Micro- and Nanoelectronics- 2003" (ICMNE-2003),Zvenigorod, Moscow district, Russia, October 6-10, 200

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Improving Aircraft Engines Prognostics and Health Management via Anticipated Model-Based Validation of Health Indicators

    Get PDF
    The aircraft engines manufacturing industry is subjected to many dependability constraints from certification authorities and economic background. In particular, the costs induced by unscheduled maintenance and delays and cancellations impose to ensure a minimum level of availability. For this purpose, Prognostics and Health Management (PHM) is used as a means to perform online periodic assessment of the engines’ health status. The whole PHM methodology is based on the processing of some variables reflecting the system’s health status named Health Indicators. The collecting of HI is an on-board embedded task which has to be specified before the entry into service for matters of retrofit costs. However, the current development methodology of PHM systems is considered as a marginal task in the industry and it is observed that most of the time, the set of HI is defined too late and only in a qualitative way. In this paper, the authors propose a novel development methodology for PHM systems centered on an anticipated model-based validation of HI. This validation is based on the use of uncertainties propagation to simulate the distributions of HI including the randomness of parameters. The paper defines also some performance metrics and criteria for the validation of the HI set. Eventually, the methodology is applied to the development of a PHM solution for an aircraft engine actuation loop. It reveals a lack of performance of the original set of HI and allows defining new ones in order to meet the specifications before the entry into service

    NASA space station automation: AI-based technology review

    Get PDF
    Research and Development projects in automation for the Space Station are discussed. Artificial Intelligence (AI) based automation technologies are planned to enhance crew safety through reduced need for EVA, increase crew productivity through the reduction of routine operations, increase space station autonomy, and augment space station capability through the use of teleoperation and robotics. AI technology will also be developed for the servicing of satellites at the Space Station, system monitoring and diagnosis, space manufacturing, and the assembly of large space structures
    corecore