39 research outputs found

    General solutions for MTTF and steady-state availability of NMR systems

    Get PDF
    Voting redundancy is a well-known technique to improve the fault tolerance of digital systems. Calculation of reliability and availability is a necessary part of every fault tolerant system design. Conventional techniques, including block diagram method, are unsuccessful when there is dependent failure, repair, or standby operation in the system. Use of Markov models also almost fails when system is composed of many elements or is repairable. However, the alternatives to reliability and availability, i.e. mean time to failure and steady state availability, respectively, are easily measurable using Markov graphs. In this paper, novel general form solutions for calculation of mean time to failure and steady-state availability of N-modular redundancy systems are presented.Hlasovací redundance je dobře známá technika pro zlepšení odolnost proti chybám digitálních systémů. Výpočet spolehlivosti a dostupnosti je nezbytnou součástí každého systému odolného proti chybám návrhu. Běžné techniky, včetně metody blokového schéma, jsou neúspěšné, pokud jsou závislé na kvantifikaci selhání, oprav nebo pohotovostních režimů provozu systému. Použití Markovových modelů také téměř selže, pokud je systém složený z mnoha prvků nebo jej lze opravit. Nicméně, alternativy k spolehlivosti a dostupnosti, tj střední doba do poruchy, respektivě ustálenou dostupnost, jsou snadno měřitelné pomocí Markovových grafů. V tomto článku je prezentováno nové řešení pro výpočet průměrné doby do poruchy a ustáleného stavu dostupnosti N-modulárních redundantních systémů.Voting redundancy is a well-known technique to improve the fault tolerance of digital systems. Calculation of reliability and availability is a necessary part of every fault tolerant system design. Conventional techniques, including block diagram method, are unsuccessful when there is dependent failure, repair, or standby operation in the system. Use of Markov models also almost fails when system is composed of many elements or is repairable. However, the alternatives to reliability and availability, i.e. mean time to failure and steady state availability, respectively, are easily measurable using Markov graphs. In this paper, novel general form solutions for calculation of mean time to failure and steady-state availability of N-modular redundancy systems are presented

    Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

    Get PDF
    Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

    Addressing Complexity and Intelligence in Systems Dependability Evaluation

    Get PDF
    Engineering and computing systems are increasingly complex, intelligent, and open adaptive. When it comes to the dependability evaluation of such systems, there are certain challenges posed by the characteristics of “complexity” and “intelligence”. The first aspect of complexity is the dependability modelling of large systems with many interconnected components and dynamic behaviours such as Priority, Sequencing and Repairs. To address this, the thesis proposes a novel hierarchical solution to dynamic fault tree analysis using Semi-Markov Processes. A second aspect of complexity is the environmental conditions that may impact dependability and their modelling. For instance, weather and logistics can influence maintenance actions and hence dependability of an offshore wind farm. The thesis proposes a semi-Markov-based maintenance model called “Butterfly Maintenance Model (BMM)” to model this complexity and accommodate it in dependability evaluation. A third aspect of complexity is the open nature of system of systems like swarms of drones which makes complete design-time dependability analysis infeasible. To address this aspect, the thesis proposes a dynamic dependability evaluation method using Fault Trees and Markov-Models at runtime.The challenge of “intelligence” arises because Machine Learning (ML) components do not exhibit programmed behaviour; their behaviour is learned from data. However, in traditional dependability analysis, systems are assumed to be programmed or designed. When a system has learned from data, then a distributional shift of operational data from training data may cause ML to behave incorrectly, e.g., misclassify objects. To address this, a new approach called SafeML is developed that uses statistical distance measures for monitoring the performance of ML against such distributional shifts. The thesis develops the proposed models, and evaluates them on case studies, highlighting improvements to the state-of-the-art, limitations and future work

    Dependability analysis of a safety critical system: the LHC Beam Dumping System at CERN

    Get PDF
    Il sistema di estrazione del fascio del nuovo acceleratore LHC del CERN (LHC Beam Dumping System, LBDS) ha il compito di rimuovere il fascio di particelle dall’anello in caso di anomalie, guasti nella macchina o al termine di una operazione. Il sistema rappresenta uno dei componenti critici per la sicurezza dell’acceleratore LHC. Il suo malfunzionamento puo’ portare alla mancata o parziale estrazione del fascio che, per le elevatissime energie raggiunte (7 TeV), ha la capacita’ di distruggere i magneti superconduttori dell’acceleratore e determinare l’arresto delle operazioni per un lungo periodo. La tesi affronta lo studio della sicurezza del sistema di estrazione del fascio di particelle ed il suo impatto sulla vita operativa del sistema in termini di numero aborto missioni(failsafe modes). Un modello dinamico ad eventi discreti stocastico del processo di guasto del sistema e’ stato ricavato partendo da una accurata analisi della sua architettura, dei modi e delle statistiche di guasto di ciascun componente. Il modello e’ stato analizzato rispetto a diversi scenari operativi, fornendo le stime della sicurezza e del numero aborto missioni per un anno di operazioni. L’analisi ha anche valutato l’efficacia delle soluzioni architetturali che sono state adottate per tollerare e prevenire il guasto nei componenti piu’ critici. I risultati ottenuti hanno dimostrato che il sistema rispetta i requisiti SIL3 dello standard di sicurezza IEC 61508, e non interferisce oltre misura sul normale funzionamento della macchina. Lo studio include anche una valutazione della sicurezza complessiva ottenuta per mezzo del sistema di protezione di cui il sistema LBDS e’ parte integrante

    Resilience of an embedded architecture using hardware redundancy

    Get PDF
    In the last decade the dominance of the general computing systems market has being replaced by embedded systems with billions of units manufactured every year. Embedded systems appear in contexts where continuous operation is of utmost importance and failure can be profound. Nowadays, radiation poses a serious threat to the reliable operation of safety-critical systems. Fault avoidance techniques, such as radiation hardening, have been commonly used in space applications. However, these components are expensive, lag behind commercial components with regards to performance and do not provide 100% fault elimination. Without fault tolerant mechanisms, many of these faults can become errors at the application or system level, which in turn, can result in catastrophic failures. In this work we study the concepts of fault tolerance and dependability and extend these concepts providing our own definition of resilience. We analyse the physics of radiation-induced faults, the damage mechanisms of particles and the process that leads to computing failures. We provide extensive taxonomies of 1) existing fault tolerant techniques and of 2) the effects of radiation in state-of-the-art electronics, analysing and comparing their characteristics. We propose a detailed model of faults and provide a classification of the different types of faults at various levels. We introduce an algorithm of fault tolerance and define the system states and actions necessary to implement it. We introduce novel hardware and system software techniques that provide a more efficient combination of reliability, performance and power consumption than existing techniques. We propose a new element of the system called syndrome that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments. We implement a software simulator and disassembler and introduce a testing framework in combination with ERA’s assembler and commercial hardware simulators

    Design for dependability: A simulation-based approach

    Get PDF
    This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced

    DEPEND: A Simulation-Based Environment for System Level Dependability Analysis

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG-1-613 and NASA NGT-5083

    Software Fault Tolerance in Real-Time Systems: Identifying the Future Research Questions

    Get PDF
    Tolerating hardware faults in modern architectures is becoming a prominent problem due to the miniaturization of the hardware components, their increasing complexity, and the necessity to reduce the costs. Software-Implemented Hardware Fault Tolerance approaches have been developed to improve the system dependability to hardware faults without resorting to custom hardware solutions. However, these come at the expense of making the satisfaction of the timing constraints of the applications/activities harder from a scheduling standpoint. This paper surveys the current state of the art of fault tolerance approaches when used in the context real-time systems, identifying the main challenges and the cross-links between these two topics. We propose a joint scheduling-failure analysis model that highlights the formal interactions among software fault tolerance mechanisms and timing properties. This model allows us to present and discuss many open research questions with the final aim to spur the future research activities

    Computer aided reliability, availability, and safety modeling for fault-tolerant computer systems with commentary on the HARP program

    Get PDF
    Many of the most challenging reliability problems of our present decade involve complex distributed systems such as interconnected telephone switching computers, air traffic control centers, aircraft and space vehicles, and local area and wide area computer networks. In addition to the challenge of complexity, modern fault-tolerant computer systems require very high levels of reliability, e.g., avionic computers with MTTF goals of one billion hours. Most analysts find that it is too difficult to model such complex systems without computer aided design programs. In response to this need, NASA has developed a suite of computer aided reliability modeling programs beginning with CARE 3 and including a group of new programs such as: HARP, HARP-PC, Reliability Analysts Workbench (Combination of model solvers SURE, STEM, PAWS, and common front-end model ASSIST), and the Fault Tree Compiler. The HARP program is studied and how well the user can model systems using this program is investigated. One of the important objectives will be to study how user friendly this program is, e.g., how easy it is to model the system, provide the input information, and interpret the results. The experiences of the author and his graduate students who used HARP in two graduate courses are described. Some brief comparisons were made with the ARIES program which the students also used. Theoretical studies of the modeling techniques used in HARP are also included. Of course no answer can be any more accurate than the fidelity of the model, thus an Appendix is included which discusses modeling accuracy. A broad viewpoint is taken and all problems which occurred in the use of HARP are discussed. Such problems include: computer system problems, installation manual problems, user manual problems, program inconsistencies, program limitations, confusing notation, long run times, accuracy problems, etc

    Fault-tolerant computer study

    Get PDF
    A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed
    corecore