1,701 research outputs found

    Reliability analysis of triple modular redundancy system with spare

    Get PDF
    Hardware redundant fault-tolerant systems and the different design approaches are discussed. The reliability analysis of fault-tolerant systems is usually done under permanent fault conditions. With statistical data suggesting that up to 90% of system failures are caused by intermittent faults, the reliability analysis of fault-tolerant systems must concentrate more on this class of faults. In this work, a reconfigurable Triple Modular Redundancy (TMR) with spare system that differentiates between permanent and intermittent faults has been built. The reconfiguration process of this system depends on both the current status of its modules and their history. Based on this, a different approach for reliability analysis under intermittent fault conditions using Markov models is presented. This approach shows a much higher system reliability compared to other redundant and non-redundant configurations

    On Fault Tolerance Methods for Networks-on-Chip

    Get PDF
    Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast

    Towards automatic Markov reliability modeling of computer architectures

    Get PDF
    The analysis and evaluation of reliability measures using time-varying Markov models is required for Processor-Memory-Switch (PMS) structures that have competing processes such as standby redundancy and repair, or renewal processes such as transient or intermittent faults. The task of generating these models is tedious and prone to human error due to the large number of states and transitions involved in any reasonable system. Therefore model formulation is a major analysis bottleneck, and model verification is a major validation problem. The general unfamiliarity of computer architects with Markov modeling techniques further increases the necessity of automating the model formulation. This paper presents an overview of the Automated Reliability Modeling (ARM) program, under development at NASA Langley Research Center. ARM will accept as input a description of the PMS interconnection graph, the behavior of the PMS components, the fault-tolerant strategies, and the operational requirements. The output of ARM will be the reliability of availability Markov model formulated for direct use by evaluation programs. The advantages of such an approach are (a) utility to a large class of users, not necessarily expert in reliability analysis, and (b) a lower probability of human error in the computation

    Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

    Get PDF
    An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software

    Advanced reliability modeling of fault-tolerant computer-based systems

    Get PDF
    Two methodologies for the reliability assessment of fault tolerant digital computer based systems are discussed. The computer-aided reliability estimation 3 (CARE 3) and gate logic software simulation (GLOSS) are assessment technologies that were developed to mitigate a serious weakness in the design and evaluation process of ultrareliable digital systems. The weak link is based on the unavailability of a sufficiently powerful modeling technique for comparing the stochastic attributes of one system against others. Some of the more interesting attributes are reliability, system survival, safety, and mission success

    Improvement of the Fault Tolerance in IoT Based Positioning Systems by Applying for Redundancy in the Controller Layer

    Get PDF
                   في السنوات الأخيرة ، ازداد انتشار تطبيقات تحديد المواقع للأنظمة القائمة على إنترنت الأشياء (IoT) بشكل متزايد ، ووجدت استخدامات في تتبع الأنشطة اليومية للأطفال والمسنين وتتبع المركبات. من وجهة نظر واحدة ، قد تحتوي البيانات التي تم الحصول عليها من الأنظمة القائمة على نظام تحديد المواقع العالمي (GPS) على خطأ ، مع مراعاة هذه العوامل ، تعتمد الطريقة المقترحة لهذه الدراسة على تطبيق تحديد المواقع القائم على إنترنت الأشياء واستبدال استخدام إنترنت الأشياء بدلاً من نظام تحديد المواقع العالمي (GPS). ومع ذلك ، لا يمكن أن يكون هذا سببًا لعدم استخدام نظام تحديد المواقع العالمي (GPS) ، ولتعزيز الموثوقية ، يمكننا تطبيق مجموعة متوازية من النظام الحديث والأساليب التقليدية في وقت واحد. على الرغم من أنه يمكن الوصول إلى إشارات نظام تحديد المواقع العالمي (GPS) فقط في الأماكن المفتوحة ، فإن أجهزة نظام تحديد المواقع العالمي (GPS) معرضة للخطأ في المقام الأول عندما يكون جهاز الاستقبال موجودًا في منطقة مدنية ، بسبب الازدحام والتداخل المحتمل. تقدم النتيجة نموذجًا قائمًا على التكرار لتحسين تحمل الأخطاء لأنظمة تحديد المواقع القائمة على إنترنت الأشياء. تُظهر نتائج المحاكاة تحسنًا بنسبة 22.5٪ في تحمل الأخطاء لنظام تحديد المواقع القائم على إنترنت الأشياء بعد تطبيق آلية التحقق المقترحة وتحسين 77.4٪ في هذا التسامح بعد التقدم للحصول على تكرار أكثر تكلفة للوحدة النمطية.  In recent years, the positioning applications of Internet-of-Things (IoT) based systems have grown increasingly popular, and are found to be useful in tracking the daily activities of children, the elderly and vehicle tracking. It can be argued that the data obtained from GPS based systems may contain error, hence taking these factors into account, the proposed method for this study is based on the application of IoT-based positioning and the replacement of using IoT instead of GPS.  This cannot, however, be a reason for not using the GPS, and in order to enhance the reliability, a parallel combination of the modern system and traditional methods simultaneously can be applied. Although GPS signals can only be accessed in open spaces, GPS devices are error-prone primarily when the receiver is located in an urban-canyons area, due to congestion and the possible interference. The outcome presents a redundancy-based model for improving the fault tolerance of IoT-based positioning systems. The simulation results show a 22.5% improvement in the fault tolerance of the IoT-based positioning system after applying the proposed validation mechanism, and a 77.4% improvement in this tolerance after applying for a more expensive module redundancy

    Intermittent/transient fault phenomena in digital systems

    Get PDF
    An overview of the intermittent/transient (IT) fault study is presented. An interval survivability evaluation of digital systems for IT faults is discussed along with a method for detecting and diagnosing IT faults in digital systems

    Reliability-aware and energy-efficient system level design for networks-on-chip

    Get PDF
    2015 Spring.Includes bibliographical references.With CMOS technology aggressively scaling into the ultra-deep sub-micron (UDSM) regime and application complexity growing rapidly in recent years, processors today are being driven to integrate multiple cores on a chip. Such chip multiprocessor (CMP) architectures offer unprecedented levels of computing performance for highly parallel emerging applications in the era of digital convergence. However, a major challenge facing the designers of these emerging multicore architectures is the increased likelihood of failure due to the rise in transient, permanent, and intermittent faults caused by a variety of factors that are becoming more and more prevalent with technology scaling. On-chip interconnect architectures are particularly susceptible to faults that can corrupt transmitted data or prevent it from reaching its destination. Reliability concerns in UDSM nodes have in part contributed to the shift from traditional bus-based communication fabrics to network-on-chip (NoC) architectures that provide better scalability, performance, and utilization than buses. In this thesis, to overcome potential faults in NoCs, my research began by exploring fault-tolerant routing algorithms. Under the constraint of deadlock freedom, we make use of the inherent redundancy in NoCs due to multiple paths between packet sources and sinks and propose different fault-tolerant routing schemes to achieve much better fault tolerance capabilities than possible with traditional routing schemes. The proposed schemes also use replication opportunistically to optimize the balance between energy overhead and arrival rate. As 3D integrated circuit (3D-IC) technology with wafer-to-wafer bonding has been recently proposed as a promising candidate for future CMPs, we also propose a fault-tolerant routing scheme for 3D NoCs which outperforms the existing popular routing schemes in terms of energy consumption, performance and reliability. To quantify reliability and provide different levels of intelligent protection, for the first time, we propose the network vulnerability factor (NVF) metric to characterize the vulnerability of NoC components to faults. NVF determines the probabilities that faults in NoC components manifest as errors in the final program output of the CMP system. With NVF aware partial protection for NoC components, almost 50% energy cost can be saved compared to the traditional approach of comprehensively protecting all NoC components. Lastly, we focus on the problem of fault-tolerant NoC design, that involves many NP-hard sub-problems such as core mapping, fault-tolerant routing, and fault-tolerant router configuration. We propose a novel design-time (RESYN) and a hybrid design and runtime (HEFT) synthesis framework to trade-off energy consumption and reliability in the NoC fabric at the system level for CMPs. Together, our research in fault-tolerant NoC routing, reliability modeling, and reliability aware NoC synthesis substantially enhances NoC reliability and energy-efficiency beyond what is possible with traditional approaches and state-of-the-art strategies from prior work
    corecore