350 research outputs found

    Study of Single Event Transient Error Mitigation

    Get PDF
    Single Event Transient (SET) errors in ground-level electronic devices are a growing concern in the radiation hardening field. However, effective SET mitigation technologies which satisfy ground-level demands such as generic, flexible, efficient, and fast, are limited. The classic Triple Modular Redundancy (TMR) method is the most well-known and popular technique in space and nuclear environment. But it leads to more than 200% area and power overheads, which is too costly to implement in ground-level applications. Meanwhile, the coding technique is extensively utilized to inhibit upset errors in storage cells, but the irregularity of combinatorial logics limits its use in SET mitigation. Therefore, SET mitigation techniques suitable for ground-level applications need to be addressed. Aware of the demands for SET mitigation techniques in ground-level applications, this thesis proposes two novel approaches based on the redundant wire and approximate logic techniques. The Redundant Wire is a SET mitigation technique. By selectively adding redundant wire connections, the technique can prohibit targeted transient faults from propagating on the fly. This thesis proposes a set of signature-based evaluation equations to efficiently estimate the protecting effect provided by each redundant wire candidates. Based on the estimated results, a greedy algorithm is used to insert the best candidate repeatedly. Simulation results substantiate that the evaluation equations can achieve up to 98% accuracy on average. Regarding protecting effects, the technique can mask 18.4% of the faults with a 4.3% area, 4.4% power, and 5.4% delay overhead on average. Overall, the quality of protecting results obtained are 2.8 times better than the previous work. Additionally, the impact of synthesis constraints and signature length are discussed. Approximate Logic is a partial TMR technique offering a trade-off between fault coverage and area overheads. The approximate logic consists of an under-approximate logic and an over-approximate logic. The under-approximate logic is a subset of the original min-terms and the over-approximate logic is a subset of the original max-terms. This thesis proposes a new algorithm for generating the two approximate logics. Through the generating process, the algorithm considers the intrinsic failure probabilities of each gate and utilizes a confidence interval estimate equation to minimize required computations. The technique is applied to two fault models, Stuck-at and SET, and the separate results are compared and discussed. The results show that the technique can reduce the error 75% with an area penalty of 46% on some circuits. The delay overheads of this technique are always two additional layers of logic. The two proposed SET mitigation techniques are both applicable to generic combinatorial logics and with high flexibility. The simulation shows promising SET mitigation ability. The proposed mitigation techniques provide designers more choices in developing reliable combinatorial logic in ground-level applications

    Dependable Computing on Inexact Hardware through Anomaly Detection.

    Full text link
    Reliability of transistors is on the decline as transistors continue to shrink in size. Aggressive voltage scaling is making the problem even worse. Scaled-down transistors are more susceptible to transient faults as well as permanent in-field hardware failures. In order to continue to reap the benefits of technology scaling, it has become imperative to tackle the challenges risen due to the decreasing reliability of devices for the mainstream commodity market. Along with the worsening reliability, achieving energy efficiency and performance improvement by scaling is increasingly providing diminishing marginal returns. More than any other time in history, the semiconductor industry faces the crossroad of unreliability and the need to improve energy efficiency. These challenges of technology scaling can be tackled by categorizing the target applications in the following two categories: traditional applications that have relatively strict correctness requirement on outputs and emerging class of soft applications, from various domains such as multimedia, machine learning, and computer vision, that are inherently inaccuracy tolerant to a certain degree. Traditional applications can be protected against hardware failures by low-cost detection and protection methods while soft applications can trade off quality of outputs to achieve better performance or energy efficiency. For traditional applications, I propose an efficient, software-only application analysis and transformation solution to detect data and control flow transient faults. The intelligence of the data flow solution lies in the use of dynamic application information such as control flow, memory and value profiling. The control flow protection technique achieves its efficiency by simplifying signature calculations in each basic block and by performing checking at a coarse-grain level. For soft applications, I develop a quality control technique. The quality control technique employs continuous, light-weight checkers to ensure that the approximation is controlled and application output is acceptable. Overall, I show that the use of low-cost checkers to produce dependable results on commodity systems---constructed from inexact hardware components---is efficient and practical.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113341/1/dskhudia_1.pd

    FireNN: Neural Networks Reliability Evaluation on Hybrid Platforms

    Get PDF
    The growth of neural networks complexity has led to adopt of hardware-accelerators to cope with the computational power required by the new architectures. The possibility to adapt the network for different platforms enhanced the interests of safety-critical applications. The reliability evaluation of neural networks are still premature and requires platforms to measure the safety standards required by mission-critical applications. For this reason, the interest in studying the reliability of neural networks is growing. We propose a new approach for evaluating the resiliency of neural networks by using hybrid platforms. The approach relies on the reconfigurable hardware for emulating the target hardware platform and performing the fault injection process. The main advantage of the proposed approach is to involve the on-hardware execution of the neural network in the reliability analysis without any intrusiveness into the network algorithm and addressing specific fault models. The implementation of FireNN, the platform based on the proposed approach, is described in the paper. Experimental analyses are performed using fault injection on AlexNet. The analyses are carried out using the FireNN platform and the results are compared with the outcome of traditional software-level evaluations. Results are discussed considering the insight into the hardware level achieved using FireNN

    Toward Fault-Tolerant Applications on Reconfigurable Systems-on-Chip

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Exploiting Inherent Program Redundancy for Fault Tolerance

    Get PDF
    Technology scaling has led to growing concerns about reliability in microprocessors. Currently, fault tolerance studies rely on creating explicitly redundant execution for fault detection or recovery, which usually involves expensive cost on performance, power, or hardware, etc. In our study, we find exploiting program's inherent redundancy can better trade off between reliability, performance, and hardware cost. This work proposes two approaches to enhance program reliability. The first approach investigates the additional fault resilience at the application level. We explore program correctness definition that views correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, multiple numerical outputs can be deemed as correct as long as they are acceptable to users. Thus faults that cause program to produce such outputs can also be tolerated. We find programs which produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and find that they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that only compute exact numerical outputs offer less error resilience at the application level. However, all programs that we have studied exhibit some enhanced fault resilience at the application level, including those that are traditionally considered as exact computations-e.g., SPECInt CPU2000. We conduct fault injection experiments and evaluate the additional fault tolerance at the application level compared to the traditional architectural level. We also exploit the relaxed requirements for numerical integrity of application-level correctness to reduce checkpoint cost: our lightweight recovery mechanism checkpoints a minimal set of program state including program counter, architectural register file, and stack; our soft-checkpointing technique identifies computations that are resilient to errors and excludes their output state from checkpoint. Both techniques incur much smaller runtime overhead than traditional checkpointing, but can successfully recover either all or a major part of program crashes in soft computations. The second approach we take studies value predictability for reducing fault rate. Value prediction is considered as additional execution, and its results are compared with corresponding computational outputs. Any mismatch between them is accounted as symptom of potential faults and incurs restoration process. To reduce misprediction rate caused by limitations of predictor itself, we characterize fault vulnerability at the instruction level and only apply value prediction to instructions that are highly susceptible to faults. We also vary threshold of confidence estimation according to instruction's vulnerability-instructions with high vulnerability are assigned with low confidence threshold, while instructions with low vulnerability are assigned with high confidence threshold. Our experimental results show benefit from such selective prediction and adaptive confidence threshold on balance between reliability and performance

    Techniques pour l'évaluation et l'amélioration du comportement des technologies émergentes face aux fautes aléatoires

    Get PDF
    The main objective of this thesis is to develop analysis and mitigation techniques that can be used to face the effects of radiation-induced soft errors - external and internal disturbances produced by radioactive particles, affecting the reliability and safety in operation complex microelectronic circuits. This thesis aims to provide industrial solutions and methodologies for the areas of terrestrial applications requiring ultimate reliability (telecommunications, medical devices, ...) to complement previous work on Soft Errors traditionally oriented aerospace, nuclear and military applications.The work presented uses a decomposition of the error sources, inside the current circuits, to highlight the most important contributors.Single Event Effects in sequential logic cells represent the current target for analysis and improvement efforts in both industry and academia. This thesis presents a state-aware analysis methodology that improves the accuracy of Soft Error Rate data for individual sequential instances based on the circuit and application. Furthermore, the intrinsic imbalance between the SEU susceptibility of different flip-flop states is exploited to implement a low-cost SER improvement strategy.Single Event Transients affecting combinational logic are considerably more difficult to model, simulate and analyze than the closely-related Single Event Upsets. The working environment may cause a myriad of distinctive transient pulses in various cell types that are used in widely different configurations. This thesis presents practical approach to a possible exhaustive Single Event Transient evaluation flow in an industrial setting. The main steps of this process consists in: a) fully characterize the standard cell library using a process and library-aware SER tool, b) evaluate SET effects in the logic networks of the circuit using a variety dynamic (simulation-based) and static (probabilistic) methods and c) compute overall SET figures taking into account the particularities of the implementation of the circuit and its environment.Fault-injection remains the primary method for analyzing the effects of soft errors. This document presents the results of functional analysis of a complex CPU. Three representative benchmarks were considered for this analysis. Accelerated simulation techniques (probabilistic calculations, clustering, parallel simulations) have been proposed and evaluated in order to develop an industrial validation environment, able to take into account very complex circuits. The results obtained allowed the development and evaluation of a hypothetical mitigation scenario that aims to significantly improve the reliability of the circuit at the lowest cost.The results obtained show that the error rate, SDC (Silent Data Corruption) and DUE (Detectable Uncorrectable Errors) can be significantly reduced by hardening a small part of the circuit (Selective mitigation).In addition to the main axis of research, some tangential topics were studied in collaboration with other teams. One of these consisted in the study of a technique for the mitigation of flip-flop soft-errors through an optimization of the Temporal De-Rating (TDR) by selectively inserting delay on the input or output of flip-flops.The Methodologies, the algorithms and the CAD tools proposed and validated as part of the work are intended for industrial use and have been included in a commercial CAD framework that offers a complete solution for assessing the reliability of circuits and complex electronic systems.L'objectif principal de cette thèse est de développer des techniques d'analyse et mitigation capables à contrer les effets des Evènements Singuliers (Single Event Effects) - perturbations externes et internes produites par les particules radioactives, affectant la fiabilité et la sureté en fonctionnement des circuits microélectroniques complexes. Cette thèse à la vocation d'offrir des solutions et méthodologies industrielles pour les domaines d'applications terrestres exigeant une fiabilité ultime (télécommunications, dispositifs médicaux, ...) en complément des travaux précédents sur les Soft Errors, traditionnellement orientés vers les applications aérospatiales, nucléaires et militaires.Les travaux présentés utilisent une décomposition de sources d'erreurs dans les circuits actuels, visant à mettre en évidence les contributeurs les plus importants.Les upsets (SEU) - Evènements Singuliers (ES) dans les cellules logiques séquentielles représentent actuellement la cible principale pour les efforts d'analyse et d'amélioration à la fois dans l'industrie et dans l'académie. Cette thèse présente une méthodologie d'analyse basée sur la prise en compte de la sensibilité de chaque état logique d'une cellule (state-awareness), approche qui améliore considérablement la précision des résultats concernant les taux des évènements pour les instances séquentielles individuelles. En outre, le déséquilibre intrinsèque entre la susceptibilité des différents états des bascules est exploité pour mettre en œuvre une stratégie d'amélioration SER à très faible coût.Les fautes transitoires (SET) affectant la logique combinatoire sont beaucoup plus difficiles à modéliser, à simuler et à analyser que les SEUs. L'environnement radiatif peut provoquer une multitude d'impulsions transitoires dans les divers types de cellules qui sont utilisés en configurations multiples. Cette thèse présente une approche pratique pour l'analyse SET, applicable à des circuits industriels très complexes. Les principales étapes de ce processus consiste à: a) caractériser complètement la bibliothèque de cellules standard, b) évaluer les SET dans les réseaux logiques du circuit en utilisant des méthodes statiques et dynamiques et c) calculer le taux SET global en prenant en compte les particularités de l'implémentation du circuit et de son environnement.L'injection de fautes reste la principale méthode d'analyse pour étudier l'impact des fautes, erreurs et disfonctionnements causés par les évènements singuliers. Ce document présente les résultats d'une analyse fonctionnelle d'un processeur complexe dans la présence des fautes et pour une sélection d'applications (benchmarks) représentatifs. Des techniques d'accélération de la simulation (calculs probabilistes, clustering, simulations parallèles) ont été proposées et évalués afin d'élaborer un environnement de validation industriel, capable à prendre en compte des circuits très complexes. Les résultats obtenus ont permis l'élaboration et l'évaluation d'un hypothétique scénario de mitigation qui vise à améliorer sensiblement, et cela au moindre coût, la fiabilité du circuit sous test. Les résultats obtenus montrent que les taux d'erreur, SDC (Silent Data Corruption) et DUE (Detectable Uncorrectable Errors) peuvent être considérablement réduits par le durcissement d'un petite partie du circuit (protection sélective). D'autres techniques spécifiques ont été également déployées: mitigation du taux de soft-errors des Flip-Flips grâce à une optimisation du Temporal De-Rating par l'insertion sélective de retard sur l'entrée ou la sortie des bascules et biasing du circuit pour privilégier les états moins sensibles.Les méthodologies, algorithmes et outils CAO proposés et validés dans le cadre de ces travaux sont destinés à un usage industriel et ont été valorisés dans le cadre de plateforme CAO commerciale visant à offrir une solution complète pour l'évaluation de la fiabilité des circuits et systèmes électroniques complexes

    New Design Techniques for Dynamic Reconfigurable Architectures

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Reliability and Security Assessment of Modern Embedded Devices

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Cross-layer Soft Error Analysis and Mitigation at Nanoscale Technologies

    Get PDF
    This thesis addresses the challenge of soft error modeling and mitigation in nansoscale technology nodes and pushes the state-of-the-art forward by proposing novel modeling, analyze and mitigation techniques. The proposed soft error sensitivity analysis platform accurately models both error generation and propagation starting from a technology dependent device level simulations all the way to workload dependent application level analysis

    Methods and architectures based on modular redundancy for fault-tolerant combinational circuits

    Get PDF
    Dans cette thèse, nous nous intéressons à la recherche d architectures fiables pour les circuits logiques. Par fiable , nous entendons des architectures permettant le masquage des fautes et les rendant de ce fait tolérantes" à ces fautes. Les solutions pour la tolérance aux fautes sont basées sur la redondance, d où le surcoût qui y est associé. La redondance peut être mise en oeuvre de différentes manières : statique ou dynamique, spatiale ou temporelle. Nous menons cette recherche en essayant de minimiser tant que possible le surcoût matériel engendré par le mécanisme de tolérance aux fautes. Le travail porte principalement sur les solutions de redondance modulaire, mais certaines études développées sont beaucoup plus générales.In this thesis, we mainly take into account the representative technique Triple Module Redundancy (TMR) as the reliability improvement technique. A voter is an necessary element in this kind of fault-tolerant architectures. The importance of reliability in majority voter is due to its application in both conventional fault-tolerant design and novel nanoelectronic systems. The property of a voter is therefore a bottleneck since it directly determines the whole performance of a redundant fault-tolerant digital IP (such as a TMR configuration). Obviously, the efficacy of TMR is to increase the reliability of digital IP. However, TMR sometimes could result in worse reliability than a simplex function module could. A better understanding of functional and signal reliability characteristics of a 3-input majority voter (majority voting in TMR) is studied. We analyze them by utilizing signal probability and boolean difference. It is well known that the acquisition of output signal probabilities is much easier compared with the obtention of output reliability. The results derived in this thesis proclaim the signal probability requirements for inputs of majority voter, and thereby reveal the conditions that TMR technique requires. This study shows the critical importance of error characteristics of majority voter, as used in fault-tolerant designs. As the flawlessness of majority voter in TMR is not true, we also proposed a fault-tolerant and simple 2-level majority voter structure for TMR. This alternative architecture for majority voter is useful in TMR schemes. The proposed solution is robust to single fault and exceeds those previous ones in terms of reliability.PARIS-Télécom ParisTech (751132302) / SudocSudocFranceF
    • …
    corecore