590 research outputs found

    Redundant residue number system code for fault-tolerant hybrid memories

    Get PDF
    Hybrid memories are envisioned as one of the alternatives to existing semiconductor memories. Although offering enormous data storage capacity, low power consumption, and reduced fabrication complexity (at least for the memory cell array), such memories are subject to a high degree of intermittent and transient faults leading to reliability issues. This article examines the use of Conventional Redundant Residue Number System (C-RRNS) error correction code, which has been extensively used in digital signal processing and communication, to detect and correct intermittent and transient cluster faults in hybrid memories. It introduces a modified version of C-RRNS, referred to as 6M-RRNS, to realize the aims at lower area overhead and performance penalty. The experimental results show that 6M-RRNS realizes a competitive error correction capability, provides larger data storage capacity, and offers higher decoding performance as compared to C-RRNS and Reed-Solomon (RS) codes. For instance, for 64-bit hybrid memories at 10% fault rate, 6M-RRNS has 98.95% error correction capability, which is 0.35% better than RS and 0.40% less than C-RRNS. Moreover, when considering 1Tbit memory, 6M-RRNS offers 4.35% more data storage capacity than RS and 11.41% more than C-RRNS. Additionally, it decodes up to 5.25 times faster than C-RRNS

    On Fault Tolerance Methods for Networks-on-Chip

    Get PDF
    Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast

    Calcul sur architecture non fiable

    Get PDF
    Although materials could be fabricated as error-free theoretically with a huge cost for worst-case design methodologies, the circuit is still susceptible to transient faults by the effects of radiation, temperature sensitivity, and etc. On the contrary, an error-resilient design enables the manufacturing process to be relieved from the variability issue so as to save material cost. Since variability and transient upsets are worsening as emerging fabrication process and size shrink are tending intense, the requirement of robust design is imminent. This thesis addresses the issue of designing on unreliable circuit. The main contributions are fourfold. Firstly a fast error-correction and low cost redundancy fault-tolerant method is presented. Moreover, we introduce judicious two-dimensional criteria to estimate the reliability and the hardware efïŹciency of a circuit. A general-purpose model offers low-redundancy error-resilience for contemporary logic systems as well as future nanoeletronic architectures. At last, a decoder against internal transient faults is designed in this work.En thĂ©orie, les circuits Ă©lectroniques conçus selon la mĂ©thode du pire-cas sont supposĂ©s garantir un fonctionnement sans erreur pourun coĂ»t d’implĂ©mentation Ă©levĂ©. Dans la pratique les circuits restent sujets aux erreurs transitoires du fait de leur sensibilitĂ© aux alĂ©astels que la radiation et la tempĂ©rature. En revanche, une conception prenant en compte la tolĂ©rance aux fautes permet de faire face Ă  detels alĂ©as comme la variabilitĂ© du processus de fabrication. De plus, les erreurs transitoires et la variabilitĂ© de fabrication s’intensiïŹentavec l’émergence de nouveaux processus de fabrication et des circuits de dimension de plus en plus rĂ©duite. La demande d’une conceptionintĂ©grant la tolĂ©rance aux fautes devient dĂ©sormais primordiale. La prĂ©sente thĂšse a pour objectif de cerner la problĂ©matique de laconception de circuits sur des puces peu ïŹables et apporte des contributions suivant quatre aspects. Dans un premier temps, nous proposonsune mĂ©thode de tolĂ©rance aux fautes, basĂ©e sur la correction d’erreurs et la redondance Ă  faible coĂ»t. Puis, nous prĂ©sentonsun critĂšre bidimensionnel judicieux permettant d’évaluer la ïŹabilitĂ© et l’efïŹcacitĂ© matĂ©rielle de circuits. Nous proposons ensuite un modĂšleuniversel qui apporte une tolĂ©rance avec fautes Ă  redondance faible pour les systĂšmes logiques d’aujourd’hui et les architecturesnanoĂ©lectroniques de demain. EnïŹn, nous dĂ©couvrons un dĂ©codeur tolĂ©rant aux fautes transitoires internes

    Virtual Runtime Application Partitions for Resource Management in Massively Parallel Architectures

    Get PDF
    This thesis presents a novel design paradigm, called Virtual Runtime Application Partitions (VRAP), to judiciously utilize the on-chip resources. As the dark silicon era approaches, where the power considerations will allow only a fraction chip to be powered on, judicious resource management will become a key consideration in future designs. Most of the works on resource management treat only the physical components (i.e. computation, communication, and memory blocks) as resources and manipulate the component to application mapping to optimize various parameters (e.g. energy efficiency). To further enhance the optimization potential, in addition to the physical resources we propose to manipulate abstract resources (i.e. voltage/frequency operating point, the fault-tolerance strength, the degree of parallelism, and the configuration architecture). The proposed framework (i.e. VRAP) encapsulates methods, algorithms, and hardware blocks to provide each application with the abstract resources tailored to its needs. To test the efficacy of this concept, we have developed three distinct self adaptive environments: (i) Private Operating Environment (POE), (ii) Private Reliability Environment (PRE), and (iii) Private Configuration Environment (PCE) that collectively ensure that each application meets its deadlines using minimal platform resources. In this work several novel architectural enhancements, algorithms and policies are presented to realize the virtual runtime application partitions efficiently. Considering the future design trends, we have chosen Coarse Grained Reconfigurable Architectures (CGRAs) and Network on Chips (NoCs) to test the feasibility of our approach. Specifically, we have chosen Dynamically Reconfigurable Resource Array (DRRA) and McNoC as the representative CGRA and NoC platforms. The proposed techniques are compared and evaluated using a variety of quantitative experiments. Synthesis and simulation results demonstrate VRAP significantly enhances the energy and power efficiency compared to state of the art.Siirretty Doriast

    Scalable Energy-efficient Microarchitectures with Computational Error Tolerance

    Get PDF
    Dennard scaling of conventional semiconductor technology has reached its limit resulting in issues pertaining to leakage current and threshold voltage. Energy-savings found at the transistor level by simply lowering supply voltage are no longer available for these devices (e.g., MOSFETs) and has reached the Landauer-Shannon limit. Recent proposals of minivolt switch technologies aim to extend the technology scaling roadmap by maintaining a high on/off ratio of drain current with a much lower supply voltage. However, high intermittent error probabilities in millivolt switches constraints their Vdd reduction for traditional architectures. Thus, there is an urgent need for scalable and energy-efficient micro-architectures with computational error-tolerance. This thesis systematically leverages the error detection and correction properties of the Redundant Residue Number System (RRNS) by varying the number of non-redundant (n) and redundant (r) components (residues), and selects and discusses trade-offs about configuration points from a two-dimensional (n, r)-RRNS design plane that meet certain capabilities of error detection and/or correction. Being able to efficiently handle resilience in this (n, r)-RRNS plane significantly improves reliability, allowing further Vdd reduction and energy savings. First, the necessary implementation details of RRNS cores are discussed. Second, scalable RRNS micro-architectures that simultaneously support both error-correction and checkpointing with restart capabilities for uncorrectable errors are proposed. Third, novel RRNS-based adaptive checkpointing&restart mechanisms are designed that automatically guarantee reliability while minimizing the energy-delay product (EDP). Finally, the RRNS design space is explored to find the optimal (n, r) configuration points. For similar reliability when compared to a conventional binary core (running at high Vdd) without computational error tolerance, the proposed RRNS scalable micro-architecture reduces EDP by 53% on average for memory-intensive workloads and by 67% on average for non-memory-intensive workloads. This thesis's second topic is to alleviate fault rate and power consumption issues of exascale computing. Faults in High-Performance Computing (HPC) have become an urgent challenge with estimated Mean Time Between Failures (MTBF) of exascale system projected as only several minutes with contemporary methodologies. Unfortunately, existing error-tolerance technologies in the context of HPC systems have serious deficiencies such as insufficient error-tolerance coverage, high power consumption, and difficult integration with existing workloads. Considering Department of Energy (DOE) guidelines that limit exascale power consumption to 20 MW, this thesis highlights the issue of energy usage and proposes a thread-level fault tolerance mechanism compatible with current state-of-the art exascale programming models while simultaneously meeting the requirements of full system error protection. Additionally, an efficient micro-architecture and corresponding mechanisms that can support thread level RRNS are discussed. Experimental results show that this strategy reduces energy consumption by 62.25% and the Energy-Delay-Product by 58.67% on average when compared with state-of-the-art black box resilience techniques.Ph.D

    Low-cost error detection through high-level synthesis

    Get PDF
    System-on-chip design is becoming increasingly complex as technology scaling enables more and more functionality on a chip. This scaling and complexity has resulted in a variety of reliability and validation challenges including logic bugs, hot spots, wear-out, and soft errors. To make matters worse, as we reach the limits of Dennard scaling, efforts to improve system performance and energy efficiency have resulted in the integration of a wide variety of complex hardware accelerators in SoCs. Thus the challenge is to design complex, custom hardware that is efficient, but also correct and reliable. High-level synthesis shows promise to address the problem of complex hardware design by providing a bridge from the high-productivity software domain to the hardware design process. Much research has been done on high-level synthesis efficiency optimizations. This thesis shows that high-level synthesis also has the power to address validation and reliability challenges through two solutions. One solution for circuit reliability is modulo-3 shadow datapaths: performing lightweight shadow computations in modulo-3 space for each main computation. We leverage the binding and scheduling flexibility of high-level synthesis to detect control errors through diverse binding and minimize area cost through intelligent checkpoint scheduling and modulo-3 reducer sharing. We introduce logic and dataflow optimizations to further reduce cost. We evaluated our technique with 12 high-level synthesis benchmarks from the arithmetic-oriented PolyBench benchmark suite using FPGA emulated netlist-level error injection. We observe coverages of 99.1% for stuck-at faults, 99.5% for soft errors, and 99.6% for timing errors with a 25.7% area cost and negligible performance impact. Leveraging a mean error detection latency of 12.75 cycles (4150x faster than end result check) for soft errors, we also explore a rollback recovery method with an additional area cost of 28.0%, observing a 175x increase in reliability against soft errors. Another solution for rapid post-silicon validation of accelerator designs is Hybrid Quick Error Detection (H-QED): inserting signature generation logic in a hardware design to create a heavily compressed signature stream that captures the internal behavior of the design at a fine temporal and spatial granularity for comparison with a reference set of signatures generated by high-level simulation to detect bugs. Using H-QED, we demonstrate an improvement in error detection latency (time elapsed from when a bug is activated to when it manifests as an observable failure) of two orders of magnitude and a threefold improvement in bug coverage compared to traditional post-silicon validation techniques. H-QED also uncovered previously unknown bugs in the CHStone benchmark suite, which is widely used by the HLS community. H-QED incurs less than 10% area overhead for the accelerator it validates with negligible performance impact, and we also introduce techniques to minimize any possible intrusiveness introduced by H-QED

    Affordable techniques for dependable microprocessor design

    Get PDF
    As high computing power is available at an affordable cost, we rely on microprocessor-based systems for much greater variety of applications. This dependence indicates that a processor failure could have more diverse impacts on our daily lives. Therefore, dependability is becoming an increasingly important quality measure of microprocessors.;Temporary hardware malfunctions caused by unstable environmental conditions can lead the processor to an incorrect state. This is referred to as a transient error or soft error. Studies have shown that soft errors are the major source of system failures. This dissertation characterizes the soft error behavior on microprocessors and presents new microarchitectural approaches that can realize high dependability with low overhead.;Our fault injection studies using RISC processors have demonstrated that different functional blocks of the processor have distinct susceptibilities to soft errors. The error susceptibility information must be reflected in devising fault tolerance schemes for cost-sensitive applications. Considering the common use of on-chip caches in modern processors, we investigated area-efficient protection schemes for memory arrays. The idea of caching redundant information was exploited to optimize resource utilization for increased dependability. We also developed a mechanism to verify the integrity of data transfer from lower level memories to the primary caches. The results of this study show that by exploiting bus idle cycles and the information redundancy, an almost complete check for the initial memory data transfer is possible without incurring a performance penalty.;For protecting the processor\u27s control logic, which usually remains unprotected, we propose a low-cost reliability enhancement strategy. We classified control logic signals into static and dynamic control depending on their changeability, and applied various techniques including commit-time checking, signature caching, component-level duplication, and control flow monitoring. Our schemes can achieve more than 99% coverage with a very small hardware addition. Finally, a virtual duplex architecture for superscalar processors is presented. In this system-level approach, the processor pipeline is backed up by a partially replicated pipeline. The replication-based checker minimizes the design and verification overheads. For a large-scale superscalar processor, the proposed architecture can bring 61.4% reduction in die area while sustaining the maximum performance

    Fault tolerant programmable digital attitude control electronics study

    Get PDF
    The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation

    Techniques for the realization of ultra- reliable spaceborne computer Final report

    Get PDF
    Bibliography and new techniques for use of error correction and redundancy to improve reliability of spaceborne computer
