39 research outputs found
General solutions for MTTF and steady-state availability of NMR systems
Voting redundancy is a well-known technique to improve the fault tolerance of digital systems. Calculation of reliability and availability is a necessary part of every fault tolerant system design. Conventional techniques, including block diagram method, are unsuccessful when there is dependent failure, repair, or standby operation in the system. Use of Markov models also almost fails when system is composed of many elements or is repairable. However, the alternatives to reliability and availability, i.e. mean time to failure and steady state availability, respectively, are easily measurable using Markov graphs. In this paper, novel general form solutions for calculation of mean time to failure and steady-state availability of N-modular redundancy systems are presented.HlasovacĂ redundance je dobĹ™e známá technika pro zlepšenĂ odolnost proti chybám digitálnĂch systĂ©mĹŻ. VĂ˝poÄŤet spolehlivosti a dostupnosti je nezbytnou součástĂ kaĹľdĂ©ho systĂ©mu odolnĂ©ho proti chybám návrhu. BěžnĂ© techniky, vÄŤetnÄ› metody blokovĂ©ho schĂ©ma, jsou neĂşspěšnĂ©, pokud jsou závislĂ© na kvantifikaci selhánĂ, oprav nebo pohotovostnĂch reĹľimĹŻ provozu systĂ©mu. PouĹľitĂ MarkovovĂ˝ch modelĹŻ takĂ© tĂ©měř selĹľe, pokud je systĂ©m sloĹľenĂ˝ z mnoha prvkĹŻ nebo jej lze opravit. NicmĂ©nÄ›, alternativy k spolehlivosti a dostupnosti, tj stĹ™ednĂ doba do poruchy, respektivÄ› ustálenou dostupnost, jsou snadno měřitelnĂ© pomocĂ MarkovovĂ˝ch grafĹŻ. V tomto ÄŤlánku je prezentováno novĂ© Ĺ™ešenĂ pro vĂ˝poÄŤet prĹŻmÄ›rnĂ© doby do poruchy a ustálenĂ©ho stavu dostupnosti N-modulárnĂch redundantnĂch systĂ©mĹŻ.Voting redundancy is a well-known technique to improve the fault tolerance of digital systems. Calculation of reliability and availability is a necessary part of every fault tolerant system design. Conventional techniques, including block diagram method, are unsuccessful when there is dependent failure, repair, or standby operation in the system. Use of Markov models also almost fails when system is composed of many elements or is repairable. However, the alternatives to reliability and availability, i.e. mean time to failure and steady state availability, respectively, are easily measurable using Markov graphs. In this paper, novel general form solutions for calculation of mean time to failure and steady-state availability of N-modular redundancy systems are presented
Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks
Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture
Addressing Complexity and Intelligence in Systems Dependability Evaluation
Engineering and computing systems are increasingly complex, intelligent, and open adaptive. When it comes to the dependability evaluation of such systems, there are certain challenges posed by the characteristics of “complexity” and “intelligence”. The first aspect of complexity is the dependability modelling of large systems with many interconnected components and dynamic behaviours such as Priority, Sequencing and Repairs. To address this, the thesis proposes a novel hierarchical solution to dynamic fault tree analysis using Semi-Markov Processes. A second aspect of complexity is the environmental conditions that may impact dependability and their modelling. For instance, weather and logistics can influence maintenance actions and hence dependability of an offshore wind farm. The thesis proposes a semi-Markov-based maintenance model called “Butterfly Maintenance Model (BMM)” to model this complexity and accommodate it in dependability evaluation. A third aspect of complexity is the open nature of system of systems like swarms of drones which makes complete design-time dependability analysis infeasible. To address this aspect, the thesis proposes a dynamic dependability evaluation method using Fault Trees and Markov-Models at runtime.The challenge of “intelligence” arises because Machine Learning (ML) components do not exhibit programmed behaviour; their behaviour is learned from data. However, in traditional dependability analysis, systems are assumed to be programmed or designed. When a system has learned from data, then a distributional shift of operational data from training data may cause ML to behave incorrectly, e.g., misclassify objects. To address this, a new approach called SafeML is developed that uses statistical distance measures for monitoring the performance of ML against such distributional shifts. The thesis develops the proposed models, and evaluates them on case studies, highlighting improvements to the state-of-the-art, limitations and future work
Dependability analysis of a safety critical system: the LHC Beam Dumping System at CERN
Il sistema di estrazione del fascio del nuovo acceleratore LHC del CERN (LHC Beam Dumping System, LBDS) ha il compito di rimuovere il fascio di particelle dall’anello in caso di anomalie, guasti nella macchina o al termine di una operazione. Il sistema rappresenta uno dei componenti critici per la sicurezza dell’acceleratore LHC. Il suo malfunzionamento puo’ portare alla mancata o parziale estrazione del fascio che, per le elevatissime energie raggiunte (7 TeV), ha la capacita’ di distruggere i magneti superconduttori dell’acceleratore e determinare l’arresto delle operazioni per un lungo periodo.
La tesi affronta lo studio della sicurezza del sistema di estrazione del fascio di particelle ed il suo impatto sulla vita operativa del sistema in termini di numero aborto missioni(failsafe modes). Un modello dinamico ad eventi discreti stocastico del processo di guasto del sistema e’ stato ricavato partendo da una accurata analisi della sua architettura, dei modi e delle statistiche di guasto di ciascun componente. Il modello e’ stato analizzato rispetto a diversi scenari operativi, fornendo le stime della sicurezza e del numero aborto missioni per un anno di operazioni. L’analisi ha anche valutato l’efficacia delle soluzioni architetturali che sono state adottate per tollerare e prevenire il guasto nei componenti piu’ critici.
I risultati ottenuti hanno dimostrato che il sistema rispetta i requisiti SIL3 dello standard di sicurezza IEC 61508, e non interferisce oltre misura sul normale funzionamento della macchina. Lo studio include anche una valutazione della sicurezza complessiva ottenuta per mezzo del sistema di protezione di cui il sistema LBDS e’ parte integrante
Resilience of an embedded architecture using hardware redundancy
In the last decade the dominance of the general computing systems market has being replaced by embedded systems with billions of units manufactured every year. Embedded systems appear in contexts where continuous operation is of utmost importance and failure can be profound.
Nowadays, radiation poses a serious threat to the reliable operation of safety-critical systems. Fault avoidance techniques, such as radiation hardening, have been commonly used in space applications. However, these components are expensive, lag behind commercial components with regards to performance and do not provide 100% fault elimination. Without fault tolerant mechanisms, many of these faults can become errors at the application or system level, which in turn, can result in catastrophic failures.
In this work we study the concepts of fault tolerance and dependability and
extend these concepts providing our own definition of resilience. We analyse the physics of radiation-induced faults, the damage mechanisms of particles and the process that leads to computing failures. We provide extensive taxonomies of 1) existing fault tolerant techniques and of 2) the effects of radiation in state-of-the-art electronics, analysing and comparing their characteristics. We propose a detailed model of faults and provide a classification of the different types of faults at various levels. We introduce an algorithm of fault tolerance and define the system states and actions necessary to implement it. We introduce novel hardware and system software techniques that provide a more efficient combination of reliability, performance and power consumption than existing techniques. We propose a new element of the system called syndrome that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments. We implement a software simulator and disassembler and introduce a testing framework in combination with ERA’s assembler and commercial hardware simulators
Design for dependability: A simulation-based approach
This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG-1-613 and NASA NGT-5083
Software Fault Tolerance in Real-Time Systems: Identifying the Future Research Questions
Tolerating hardware faults in modern architectures is becoming a prominent problem due to the miniaturization of the hardware components, their increasing complexity, and the necessity to reduce the costs. Software-Implemented Hardware Fault Tolerance approaches have been developed to improve the system dependability to hardware faults without resorting to custom hardware solutions. However, these come at the expense of making the satisfaction of the timing constraints of the applications/activities harder from a scheduling standpoint. This paper surveys the current state of the art of fault tolerance approaches when used in the context real-time systems, identifying the main challenges and the cross-links between these two topics. We propose a joint scheduling-failure analysis model that highlights the formal interactions among software fault tolerance mechanisms and timing properties. This model allows us to present and discuss many open research questions with the final aim to spur the future research activities
Computer aided reliability, availability, and safety modeling for fault-tolerant computer systems with commentary on the HARP program
Many of the most challenging reliability problems of our present decade involve complex distributed systems such as interconnected telephone switching computers, air traffic control centers, aircraft and space vehicles, and local area and wide area computer networks. In addition to the challenge of complexity, modern fault-tolerant computer systems require very high levels of reliability, e.g., avionic computers with MTTF goals of one billion hours. Most analysts find that it is too difficult to model such complex systems without computer aided design programs. In response to this need, NASA has developed a suite of computer aided reliability modeling programs beginning with CARE 3 and including a group of new programs such as: HARP, HARP-PC, Reliability Analysts Workbench (Combination of model solvers SURE, STEM, PAWS, and common front-end model ASSIST), and the Fault Tree Compiler. The HARP program is studied and how well the user can model systems using this program is investigated. One of the important objectives will be to study how user friendly this program is, e.g., how easy it is to model the system, provide the input information, and interpret the results. The experiences of the author and his graduate students who used HARP in two graduate courses are described. Some brief comparisons were made with the ARIES program which the students also used. Theoretical studies of the modeling techniques used in HARP are also included. Of course no answer can be any more accurate than the fidelity of the model, thus an Appendix is included which discusses modeling accuracy. A broad viewpoint is taken and all problems which occurred in the use of HARP are discussed. Such problems include: computer system problems, installation manual problems, user manual problems, program inconsistencies, program limitations, confusing notation, long run times, accuracy problems, etc
Fault-tolerant computer study
A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed