28 research outputs found
Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices
This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results
Choose-Your-Own Adventure: A Lightweight, High-Performance Approach To Defect And Variation Mitigation In Reconfigurable Logic
For field-programmable gate arrays (FPGAs), fine-grained pre-computed alternative configurations, combined with simple test-based selection, produce limited per-chip specialization to counter yield loss, increased delay, and increased energy costs that come from fabrication defects and variation. This lightweight approach achieves much of the benefit of knowledge-based full specialization while reducing to practical, palatable levels the computational, testing, and load-time costs that obstruct the application of the knowledge-based approach. In practice this may more than double the power-limited computational capabilities of dies fabricated with 22nm technologies.
Contributions of this work:
• Choose-Your-own-Adventure (CYA), a novel, lightweight, scalable methodology to achieve defect and variation mitigation
• Implementation of CYA, including preparatory components (generation of diverse alternative paths) and FPGA load-time components
• Detailed performance characterization of CYA
– Comparison to conventional loading and dynamic frequency and voltage scaling (DFVS)
– Limit studies to characterize the quality of the CYA implementation and identify potential areas for further optimizatio
The Case for Reconfigurable Components with Logic Scrubbing: Regular Hygiene Keeps Logic FIT (low)
As we approach atomic-scale logic, we must accommodate an increased rate of manufacturing defects, transient upsets, and in-field persistent failures. High defect rates demand reconfiguration to avoid defective components, and transient upsets demand online error detection to catch failures. Combining these techniques we can detect in-field persistent failures when they occur and reconfigure around them. However, since failures may be logically masked for long periods of time, persistent failures may accumulate silently; this integration of errors over time means the effective failure rate for persistent errors can exceed transient upset rates. As a result, logic scrubbing is necessary to prevent the silent accumulation of an undetectable number of persistent errors. We provide simple analysis to illustrate quantitatively how this phenomena can be a concern
Interconnect yield analysis and fault tolerance for field programmable gate arrays
Imperial Users onl
Amorphous Computing
Amorphous computing is the development of organizational principles and programming languages for obtaining coherent behaviors from the cooperation of myriads of unreliable parts that are interconnected in unknown, irregular, and time-varying ways. The impetus for amorphous computing comes from developments in microfabrication and fundamental biology, each of which is the basis of a kernel technology that makes it possible to build or grow huge numbers of almost-identical information-processing units at almost no cost. This paper sets out a research agenda for realizing the potential of amorphous computing and surveys some initial progress, both in programming and in fabrication. We describe some approaches to programming amorphous systems, which are inspired by metaphors from biology and physics. We also present the basic ideas of cellular computing, an approach to constructing digital-logic circuits within living cells by representing logic levels by concentrations DNA-binding proteins
Performance analysis of fault-tolerant nanoelectronic memories
Performance growth in microelectronics, as described by Moore’s law, is steadily
approaching its limits. Nanoscale technologies are increasingly being explored as a
practical solution to sustaining and possibly surpassing current performance trends of
microelectronics. This work presents an in-depth analysis of the impact on performance,
of incorporating reliability schemes into the architecture of a crossbar molecular switch
nanomemory and demultiplexer. Nanoelectronics are currently in their early stages, and
so fabrication and design methodologies are still in the process of being studied and
developed. The building blocks of nanotechnology are fabricated using bottom-up
processes, which leave them highly susceptible to defects. Hence, it is very important that
defect and fault-tolerant schemes be incorporated into the design of nanotechnology
related devices.
In this dissertation, we focus on the study of a novel and promising class of
computer chip memories called crossbar molecular switch memories and their
demultiplexer addressing units. A major part of this work was the design of a defect and
fault tolerance scheme we called the Multi-Switch Junction (MSJ) scheme. The MSJ scheme takes advantage of the regular array geometry of the crossbar nanomemory to
create multiple switches in the fabric of the crossbar nanomemory for the storage of a
single bit.
Implementing defect and fault tolerant schemes come at a performance cost to the
crossbar nanomemory; the challenge becomes achieving a balance between device
reliability and performance. We have studied the reliability induced performance penalties
as they relate to the time (delay) it takes to access a bit, and the amount of power
dissipated by the process. Also, MSJ was compared to the banking and error correction
coding fault tolerant schemes. Studies were also conducted to ascertain the potential
benefits of integrating our MSJ scheme with the banking scheme. Trade-off analysis
between access time delay, power dissipation and reliability is outlined and presented in
this work.
Results show the MSJ scheme increases the reliability of the crossbar
nanomemory and demultiplexer. Simulation results also indicated that MSJ works very
well for smaller nanomemory array sizes, with reliabilities of 100% for molecular switch
failure rates in the 10% or less range
A survey of fault-tolerance algorithms for reconfigurable nano-crossbar arrays
ACM Comput. Surv. Volume 50, issue 6 (November 2017)Nano-crossbar arrays have emerged as a promising and viable technology to improve computing performance of electronic circuits beyond the limits of current CMOS. Arrays offer both structural efficiency with reconfiguration and prospective capability of integration with different technologies. However, certain problems need to be addressed, and the most important one is the prevailing occurrence of faults. Considering fault rate projections as high as 20% that is much higher than those of CMOS, it is fair to expect sophisticated fault-tolerance methods. The focus of this survey article is the assessment and evaluation of these methods and related algorithms applied in logic mapping and configuration processes. As a start, we concisely explain reconfigurable nano-crossbar arrays with their fault characteristics and models. Following that, we demonstrate configuration techniques of the arrays in the presence of permanent faults and elaborate on two main fault-tolerance methodologies, namely defect-unaware and defect-aware approaches, with a short review on advantages and disadvantages. For both methodologies, we present detailed experimental results of related algorithms regarding their strengths and weaknesses with a comprehensive yield, success rate and runtime analysis. Next, we overview fault-tolerance approaches for transient faults. As a conclusion, we overview the proposed algorithms with future directions and upcoming challenges.This work is supported by the EU-H2020-RISE project NANOxCOMP no 691178 and the TUBITAK-Career
project no 113E760