7 research outputs found

    Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms

    Get PDF
    Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors. For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process. Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead. Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation

    SYNTHESIS OF SOFT ERROR TOLERANT COMBINATIONAL CIRCUITS

    Get PDF

    SYNTHESIS OF SOFT ERROR TOLERANT COMBINATIONAL CIRCUITS

    Get PDF

    An advanced Framework for efficient IC optimization based on analytical models engine

    Get PDF
    En base als reptes sorgits a conseqüència de l'escalat de la tecnologia, la present tesis desenvolupa i analitza un conjunt d'eines orientades a avaluar la sensibilitat a la propagació d'esdeveniments SET en circuits microelectrònics. S'han proposant varies mètriques de propagació de SETs considerant l'impacto dels emmascaraments lògic, elèctric i combinat lògic-elèctric. Aquestes mètriques proporcionen una via d'anàlisi per quantificar tant les regions més susceptibles a propagar SETs com les sortides més susceptibles de rebre'ls. S'ha desenvolupat un conjunt d'algorismes de cerca de camins sensibilitzables altament adaptables a múltiples aplicacions, un sistema lògic especific i diverses tècniques de simplificació de circuits. S'ha demostrat que el retard d'un camí donat depèn dels vectors de sensibilització aplicats a les portes que formen part del mateix, essent aquesta variació de retard comparable a la atribuïble a les variacions paramètriques del proces.En base a los desafíos surgidos a consecuencia del escalado de la tecnología, la presente tesis desarrolla y analiza un conjunto de herramientas orientadas a evaluar la sensibilidad a la propagación de eventos SET en circuitos microelectrónicos. Se han propuesto varias métricas de propagación de SETs considerando el impacto de los enmascaramientos lógico, eléctrico y combinado lógico-eléctrico. Estas métricas proporcionan una vía de análisis para cuantificar tanto las regiones más susceptibles a propagar eventos SET como las salidas más susceptibles a recibirlos. Ha sido desarrollado un conjunto de algoritmos de búsqueda de caminos sensibilizables altamente adaptables a múltiples aplicaciones, un sistema lógico especifico y diversas técnicas de simplificación de circuitos. Se ha demostrado que el retardo de un camino dado depende de los vectores de sensibilización aplicados a las puertas que forman parte del mismo, siendo esta variación de retardo comparable a la atribuible a las variaciones paramétricas del proceso.Based on the challenges arising as a result of technology scaling, this thesis develops and evaluates a complete framework for SET propagation sensitivity. The framework comprises a number of processing tools capable of handling circuits with high complexity in an efficient way. Various SET propagation metrics have been proposed considering the impact of logic, electric and combined logic-electric masking. Such metrics provide a valuable vehicle to grade either in-circuit regions being more susceptible of propagating SETs toward the circuit outputs or circuit outputs more susceptible to produce SET. A quite efficient and customizable true path finding algorithm with a specific logic system has been constructed and its efficacy demonstrated on large benchmark circuits. It has been shown that the delay of a path depends on the sensitization vectors applied to the gates within the path. In some cases, this variation is comparable to the one caused by process parameters variation

    Soft-Error Resilience Framework For Reliable and Energy-Efficient CMOS Logic and Spintronic Memory Architectures

    Get PDF
    The revolution in chip manufacturing processes spanning five decades has proliferated high performance and energy-efficient nano-electronic devices across all aspects of daily life. In recent years, CMOS technology scaling has realized billions of transistors within large-scale VLSI chips to elevate performance. However, these advancements have also continually augmented the impact of Single-Event Transient (SET) and Single-Event Upset (SEU) occurrences which precipitate a range of Soft-Error (SE) dependability issues. Consequently, soft-error mitigation techniques have become essential to improve systems\u27 reliability. Herein, first, we proposed optimized soft-error resilience designs to improve robustness of sub-micron computing systems. The proposed approaches were developed to deliver energy-efficiency and tolerate double/multiple errors simultaneously while incurring acceptable speed performance degradation compared to the prior work. Secondly, the impact of Process Variation (PV) at the Near-Threshold Voltage (NTV) region on redundancy-based SE-mitigation approaches for High-Performance Computing (HPC) systems was investigated to highlight the approach that can realize favorable attributes, such as reduced critical datapath delay variation and low speed degradation. Finally, recently, spin-based devices have been widely used to design Non-Volatile (NV) elements such as NV latches and flip-flops, which can be leveraged in normally-off computing architectures for Internet-of-Things (IoT) and energy-harvesting-powered applications. Thus, in the last portion of this dissertation, we design and evaluate for soft-error resilience NV-latching circuits that can achieve intriguing features, such as low energy consumption, high computing performance, and superior soft errors tolerance, i.e., concurrently able to tolerate Multiple Node Upset (MNU), to potentially become a mainstream solution for the aerospace and avionic nanoelectronics. Together, these objectives cooperate to increase energy-efficiency and soft errors mitigation resiliency of larger-scale emerging NV latching circuits within iso-energy constraints. In summary, addressing these reliability concerns is paramount to successful deployment of future reliable and energy-efficient CMOS logic and spintronic memory architectures with deeply-scaled devices operating at low-voltages

    Hardware Error Detection Using AN-Codes

    Get PDF
    Due to the continuously decreasing feature sizes and the increasing complexity of integrated circuits, commercial off-the-shelf (COTS) hardware is becoming less and less reliable. However, dedicated reliable hardware is expensive and usually slower than commodity hardware. Thus, economic pressure will most likely result in the usage of unreliable COTS hardware in safety-critical systems. The usage of unreliable, COTS hardware in safety-critical systems results in the need for software-implemented solutions for handling execution errors caused by this unreliable hardware. In this thesis, we provide techniques for detecting hardware errors that disturb the execution of a program. The detection provided facilitates handling of these errors, for example, by retry or graceful degradation. We realize the error detection by transforming unsafe programs that are not guaranteed to detect execution errors into safe programs that detect execution errors with a high probability. Therefore, we use arithmetic AN-, ANB-, ANBD-, and ANBDmem-codes. These codes detect errors that modify data during storage or transport and errors that disturb computations as well. Furthermore, the error detection provided is independent of the hardware used. We present the following novel encoding approaches: - Software Encoded Processing (SEP) that transforms an unsafe binary into a safe execution at runtime by applying an ANB-code, and - Compiler Encoded Processing (CEP) that applies encoding at compile time and provides different levels of safety by using different arithmetic codes. In contrast to existing encoding solutions, SEP and CEP allow to encode applications whose data and control flow is not completely predictable at compile time. For encoding, SEP and CEP use our set of encoded operations also presented in this thesis. To the best of our knowledge, we are the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations. Our evaluations show that encoding with SEP and CEP significantly reduces the amount of erroneous output caused by hardware errors. Furthermore, our evaluations show that, in contrast to replication-based approaches for detecting errors, arithmetic encoding facilitates the detection of permanent hardware errors. This increased reliability does not come for free. However, unexpectedly the runtime costs for the different arithmetic codes supported by CEP compared to redundancy increase only linearly, while the gained safety increases exponentially
    corecore