8 research outputs found

    BUILT IN SELF TEST FOR SAD MODULE IN MOTION ARRAY DETECTION

    Get PDF
    A novel method develops a built-in self-detection and correction (BISDC) architecture for motion estimation computing arrays(MECAs).Based on the error detection & correction concepts of biresidue codes, any single error in each processing element in an MECA can be effectively detected and corrected online using the proposed BISD and built-in selfcorrection circuits. Performance analysis and evaluation demonstrate that the proposed BISDC architecture performs well in error detection and correction with minor area i.e single error bit detection and correction . An advanced model has been proposed for multi bit detection using efficient adder implementation .a comparision is performed between efficient adder and processing element resultant

    Scalable Energy-efficient Microarchitectures with Computational Error Tolerance

    Get PDF
    Dennard scaling of conventional semiconductor technology has reached its limit resulting in issues pertaining to leakage current and threshold voltage. Energy-savings found at the transistor level by simply lowering supply voltage are no longer available for these devices (e.g., MOSFETs) and has reached the Landauer-Shannon limit. Recent proposals of minivolt switch technologies aim to extend the technology scaling roadmap by maintaining a high on/off ratio of drain current with a much lower supply voltage. However, high intermittent error probabilities in millivolt switches constraints their Vdd reduction for traditional architectures. Thus, there is an urgent need for scalable and energy-efficient micro-architectures with computational error-tolerance. This thesis systematically leverages the error detection and correction properties of the Redundant Residue Number System (RRNS) by varying the number of non-redundant (n) and redundant (r) components (residues), and selects and discusses trade-offs about configuration points from a two-dimensional (n, r)-RRNS design plane that meet certain capabilities of error detection and/or correction. Being able to efficiently handle resilience in this (n, r)-RRNS plane significantly improves reliability, allowing further Vdd reduction and energy savings. First, the necessary implementation details of RRNS cores are discussed. Second, scalable RRNS micro-architectures that simultaneously support both error-correction and checkpointing with restart capabilities for uncorrectable errors are proposed. Third, novel RRNS-based adaptive checkpointing&restart mechanisms are designed that automatically guarantee reliability while minimizing the energy-delay product (EDP). Finally, the RRNS design space is explored to find the optimal (n, r) configuration points. For similar reliability when compared to a conventional binary core (running at high Vdd) without computational error tolerance, the proposed RRNS scalable micro-architecture reduces EDP by 53% on average for memory-intensive workloads and by 67% on average for non-memory-intensive workloads. This thesis's second topic is to alleviate fault rate and power consumption issues of exascale computing. Faults in High-Performance Computing (HPC) have become an urgent challenge with estimated Mean Time Between Failures (MTBF) of exascale system projected as only several minutes with contemporary methodologies. Unfortunately, existing error-tolerance technologies in the context of HPC systems have serious deficiencies such as insufficient error-tolerance coverage, high power consumption, and difficult integration with existing workloads. Considering Department of Energy (DOE) guidelines that limit exascale power consumption to 20 MW, this thesis highlights the issue of energy usage and proposes a thread-level fault tolerance mechanism compatible with current state-of-the art exascale programming models while simultaneously meeting the requirements of full system error protection. Additionally, an efficient micro-architecture and corresponding mechanisms that can support thread level RRNS are discussed. Experimental results show that this strategy reduces energy consumption by 62.25% and the Energy-Delay-Product by 58.67% on average when compared with state-of-the-art black box resilience techniques.Ph.D

    Hardware Error Detection Using AN-Codes

    Get PDF
    Due to the continuously decreasing feature sizes and the increasing complexity of integrated circuits, commercial off-the-shelf (COTS) hardware is becoming less and less reliable. However, dedicated reliable hardware is expensive and usually slower than commodity hardware. Thus, economic pressure will most likely result in the usage of unreliable COTS hardware in safety-critical systems. The usage of unreliable, COTS hardware in safety-critical systems results in the need for software-implemented solutions for handling execution errors caused by this unreliable hardware. In this thesis, we provide techniques for detecting hardware errors that disturb the execution of a program. The detection provided facilitates handling of these errors, for example, by retry or graceful degradation. We realize the error detection by transforming unsafe programs that are not guaranteed to detect execution errors into safe programs that detect execution errors with a high probability. Therefore, we use arithmetic AN-, ANB-, ANBD-, and ANBDmem-codes. These codes detect errors that modify data during storage or transport and errors that disturb computations as well. Furthermore, the error detection provided is independent of the hardware used. We present the following novel encoding approaches: - Software Encoded Processing (SEP) that transforms an unsafe binary into a safe execution at runtime by applying an ANB-code, and - Compiler Encoded Processing (CEP) that applies encoding at compile time and provides different levels of safety by using different arithmetic codes. In contrast to existing encoding solutions, SEP and CEP allow to encode applications whose data and control flow is not completely predictable at compile time. For encoding, SEP and CEP use our set of encoded operations also presented in this thesis. To the best of our knowledge, we are the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations. Our evaluations show that encoding with SEP and CEP significantly reduces the amount of erroneous output caused by hardware errors. Furthermore, our evaluations show that, in contrast to replication-based approaches for detecting errors, arithmetic encoding facilitates the detection of permanent hardware errors. This increased reliability does not come for free. However, unexpectedly the runtime costs for the different arithmetic codes supported by CEP compared to redundancy increase only linearly, while the gained safety increases exponentially

    Hardware Error Detection Using AN-Codes

    Get PDF
    Due to the continuously decreasing feature sizes and the increasing complexity of integrated circuits, commercial off-the-shelf (COTS) hardware is becoming less and less reliable. However, dedicated reliable hardware is expensive and usually slower than commodity hardware. Thus, economic pressure will most likely result in the usage of unreliable COTS hardware in safety-critical systems. The usage of unreliable, COTS hardware in safety-critical systems results in the need for software-implemented solutions for handling execution errors caused by this unreliable hardware. In this thesis, we provide techniques for detecting hardware errors that disturb the execution of a program. The detection provided facilitates handling of these errors, for example, by retry or graceful degradation. We realize the error detection by transforming unsafe programs that are not guaranteed to detect execution errors into safe programs that detect execution errors with a high probability. Therefore, we use arithmetic AN-, ANB-, ANBD-, and ANBDmem-codes. These codes detect errors that modify data during storage or transport and errors that disturb computations as well. Furthermore, the error detection provided is independent of the hardware used. We present the following novel encoding approaches: - Software Encoded Processing (SEP) that transforms an unsafe binary into a safe execution at runtime by applying an ANB-code, and - Compiler Encoded Processing (CEP) that applies encoding at compile time and provides different levels of safety by using different arithmetic codes. In contrast to existing encoding solutions, SEP and CEP allow to encode applications whose data and control flow is not completely predictable at compile time. For encoding, SEP and CEP use our set of encoded operations also presented in this thesis. To the best of our knowledge, we are the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations. Our evaluations show that encoding with SEP and CEP significantly reduces the amount of erroneous output caused by hardware errors. Furthermore, our evaluations show that, in contrast to replication-based approaches for detecting errors, arithmetic encoding facilitates the detection of permanent hardware errors. This increased reliability does not come for free. However, unexpectedly the runtime costs for the different arithmetic codes supported by CEP compared to redundancy increase only linearly, while the gained safety increases exponentially

    Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing

    Get PDF
    Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To determine if deadlines are met, RTSs are analyzed with respect to average execution time (AET) and worst case execution time (WCET), where AET is used for soft RTSs, and WCET is used for hard RTSs. When fault tolerance is employed in both soft and hard RTSs, the time overhead caused due to usage of fault tolerance may be the reason that deadlines in RTSs are violated. Therefore, there is a need to optimize the usage of fault tolerance in RTSs. To enable correct operation of RTSs in the presence of soft errors, in this thesis we consider a fault tolerance technique, Roll-back Recovery with Checkpointing (RRC), that efficiently copes with soft errors. The major drawback of RRC is that it introduces a time overhead which depends on the number of checkpoints that are used in RRC. Depending on how the checkpoints are distributed throughout the execution of the job, we consider the two checkpointing schemes: equidistant checkpointing, where the checkpoints are evenly distributed, and non-equidistant checkpointing, where the checkpoints are not evenly distributed. The goal of this thesis is to provide an optimization framework for RRC when used in RTSs while considering different optimization objectives which are important for RTSs. The purpose of such an optimization framework is to assist the designer of an RTS during the early design stage, when the designer needs to explore different fault tolerance techniques, and choose a particular fault tolerance technique that meets the specification requirements for the RTS that is to be implemented. By using the optimization framework presented in this thesis, the designer of an RTS can acquire knowledge if RRC is a suitable fault tolerance technique for the RTS which needs to be implemented. The proposed optimization framework includes the following optimization objectives. For soft RTSs, we consider optimization of RRC with respect to AET. For the case of equidistant checkpointing, the optimization framework provides the optimal number of checkpoints resulting in the minimal AET. For non-equidistant checkpointing, the optimization framework provides two adaptive techniques that estimate the probability of errors and adjust the checkpointing scheme (the number of checkpoints over time) with the goal to minimize the AET. While for soft RTSs analyses based on AET are sufficient, for hard RTSs it is more important to maximize the probability that deadlines are met. To evaluate to what extent a deadline is met, in this thesis we have used the statistical concept Level of Confidence (LoC). The LoC with respect to a given deadline defines the probability that a job (or a set of jobs) completes before the given deadline. As a metric, LoC is equally applicable for soft and hard RTSs. However, as an optimization objective LoC is used in hard RTSs. Therefore, for hard RTSs, we consider optimization of RRC with respect to LoC. For equidistant checkpointing, the optimization framework provides (1) for a single job, the optimal number of checkpoints resulting in the maximal LoC with respect to a given deadline, and (2) for a set of jobs running in a sequence and a global deadline, the optimization framework provides the number of checkpoints that should be assigned to each job such that the LoC with respect to the global deadline is maximized. For non-equidistant checkpointing, the optimization framework provides how a given number of checkpoints should be distributed such that the LoC with respect to a given deadline is maximized. Since the specification of an RTS may have a reliability requirement such that all deadlines need to be met with some probability, in this thesis we have introduced the concept Guaranteed Completion Time which refers to a completion time such that the probability that a job completes within this time is at least equal to a given reliability requirement. The optimization framework includes Guaranteed Completion Time as an optimization objective, and with respect to the Guaranteed Completion Time, the framework provides the optimal number of checkpoints, while assuming equidistant checkpointing, that results in the minimal Guaranteed Completion Time
    corecore