8,139 research outputs found

    Multi-State System Reliability: A New and Systematic Review

    Get PDF
    AbstractReliability analysis considering multiple possible states is known as multi-state (MS) reliability analysis. Multi-state system reliability models allow both the system and its components to assume more than two levels of performance. Through multi-state reliability models provide more realistic and more precise representations of engineering systems, they are much more complex and present major difficulties in system definition and performance evaluation. MSS reliability has received a substantial amount of attention in the past four decades. This article presents a new and systematic review about multi-state system reliability. A timely review is an effective work related to improving the development of MSS theory. The review about the latest studies and advances about multi-state system reliability evaluation, multi-state systems optimization and multi-state systems maintenance is summarized in this paper

    A Condition-Based Maintenance Model for Assets with Accelerated Deterioration Due to Fault Propagation

    Get PDF
    Complex industrial assets such as power transformers are subject to accelerated deterioration when one of its constituent component malfunctions, affecting the condition of other components, which is a phenomenon called fault propagation. In this paper, we present a novel approach for optimizing condition-based maintenance policies for such assets by modelling their deterioration as a multiple dependent deterioration path process. The aim of the policy is to replace the malfunctioned component and mitigate accelerated deterioration at minimal impact to the business. The maintenance model provides guidance on determining inspection and maintenance strategies to optimize asset availability and operational cost.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TR.2015.243913

    Selective maintenance for multistate series systems with S-dependent components

    Get PDF
    YesIn this paper, we will consider the selective maintenance problem for multistate series systems with stochastic dependent components. In multistate systems, the health state of a component may vary from perfect functioning to complete failure. The stochastic dependence (S-dependence) between components is discussed and categorized into two types in multistate context. First, the failure of a component can immediately cause complete failures of some other components in the system. Second, as components deteriorate, the reduced working performance rate of a multistate component affects the state as well as the degradation rate of its subsequent components in series structure. The system reliability is evaluated using an approach based on stochastic process. A cost-based selective maintenance model is developed for the multistate system with S-dependent components to maximize the total system profit, which includes the production gain and loss in the next mission as well as possible maintenance costs for the system. Analyses of systems with independent and dependent components are provided. It is observed that ignoring S-dependence in the system may lead to alternative maintenance decision making and an optimistic estimation of the system performance

    Cross-layer Soft Error Analysis and Mitigation at Nanoscale Technologies

    Get PDF
    This thesis addresses the challenge of soft error modeling and mitigation in nansoscale technology nodes and pushes the state-of-the-art forward by proposing novel modeling, analyze and mitigation techniques. The proposed soft error sensitivity analysis platform accurately models both error generation and propagation starting from a technology dependent device level simulations all the way to workload dependent application level analysis

    Study of fault-tolerant software technology

    Get PDF
    Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

    The implementation and use of Ada on distributed systems with reliability requirements

    Get PDF
    The issues involved in the use of the programming language Ada on distributed systems are discussed. The effects of Ada programs on hardware failures such as loss of a processor are emphasized. It is shown that many Ada language elements are not well suited to this environment. Processor failure can easily lead to difficulties on those processors which remain. As an example, the calling task in a rendezvous may be suspended forever if the processor executing the serving task fails. A mechanism for detecting failure is proposed and changes to the Ada run time support system are suggested which avoid most of the difficulties. Ada program structures are defined which allow programs to reconfigure and continue to provide service following processor failure

    Design of a fault tolerant airborne digital computer. Volume 2: Computational requirements and technology

    Get PDF
    This final report summarizes the work on the design of a fault tolerant digital computer for aircraft. Volume 2 is composed of two parts. Part 1 is concerned with the computational requirements associated with an advanced commercial aircraft. Part 2 reviews the technology that will be available for the implementation of the computer in the 1975-1985 period. With regard to the computation task 26 computations have been categorized according to computational load, memory requirements, criticality, permitted down-time, and the need to save data in order to effect a roll-back. The technology part stresses the impact of large scale integration (LSI) on the realization of logic and memory. Also considered was module interconnection possibilities so as to minimize fault propagation

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version