479 research outputs found
Design of a fault tolerant airborne digital computer. Volume 1: Architecture
This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive
Study of Single Event Transient Error Mitigation
Single Event Transient (SET) errors in ground-level electronic devices are a growing concern in the radiation hardening field. However, effective SET mitigation technologies which satisfy ground-level demands such as generic, flexible, efficient, and fast, are limited. The classic Triple Modular Redundancy (TMR) method is the most well-known and popular technique in space and nuclear environment. But it leads to more than 200% area and power overheads, which is too costly to implement in ground-level applications. Meanwhile, the coding technique is extensively utilized to inhibit upset errors in storage cells, but the irregularity of combinatorial logics limits its use in SET mitigation. Therefore, SET mitigation techniques suitable for ground-level applications need to be addressed.
Aware of the demands for SET mitigation techniques in ground-level applications, this thesis proposes two novel approaches based on the redundant wire and approximate logic techniques.
The Redundant Wire is a SET mitigation technique. By selectively adding redundant wire connections, the technique can prohibit targeted transient faults from propagating on the fly. This thesis proposes a set of signature-based evaluation equations to efficiently estimate the protecting effect provided by each redundant wire candidates. Based on the estimated results, a greedy algorithm is used to insert the best candidate repeatedly. Simulation results substantiate that the evaluation equations can achieve up to 98% accuracy on average. Regarding protecting effects, the technique can mask 18.4% of the faults with a 4.3% area, 4.4% power, and 5.4% delay overhead on average. Overall, the quality of protecting results obtained are 2.8 times better than the previous work. Additionally, the impact of synthesis constraints and signature length are discussed.
Approximate Logic is a partial TMR technique offering a trade-off between fault coverage and area overheads. The approximate logic consists of an under-approximate logic and an over-approximate logic. The under-approximate logic is a subset of the original min-terms and the over-approximate logic is a subset of the original max-terms. This thesis proposes a new algorithm for generating the two approximate logics. Through the generating process, the algorithm considers the intrinsic failure probabilities of each gate and utilizes a confidence interval estimate equation to minimize required computations. The technique is applied to two fault models, Stuck-at and SET, and the separate results are compared and discussed. The results show that the technique can reduce the error 75% with an area penalty of 46% on some circuits. The delay overheads of this technique are always two additional layers of logic.
The two proposed SET mitigation techniques are both applicable to generic combinatorial logics and with high flexibility. The simulation shows promising SET mitigation ability. The proposed mitigation techniques provide designers more choices in developing reliable combinatorial logic in ground-level applications
On Fault Tolerance Methods for Networks-on-Chip
Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit.
This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels.
The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model.
The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated.
At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast
Techniques for the realization of ultra- reliable spaceborne computer Final report
Bibliography and new techniques for use of error correction and redundancy to improve reliability of spaceborne computer
Error detection for data communication systems
A description of the problems encountered in the data communications field and the various solutions can be found in a number of diverse, and often theoretical sources. My intention in writing this thesis is to bring together, in a practical and understandable manner, the theory and the application of a method of error detection used extensively in data communication systems known as the Cyclic Redundancy Check (CRC).
To provide some background on the subject, a description of a data communication system is presented, and the possible sources of error are explored in some detail. Data transmission formats are described, and a comparison of various error detection schemes is presented so that the advantages of the CRC can be more readily understood.
The theory behind the CRC and its physical implementation is given, along with a detailed example showing the effectiveness of the CRC for error detection. Finally, the current state-of-the-art technology available for implementing the various error detection schemes is discussed, with particular emphasis on those technologies that perform the Cyclic Redundancy Check
Error control coding for semiconductor memories
All modern computers have memories built from VLSI RAM chips.
Individually, these devices are highly reliable and any single chip
may perform for decades before failing. However, when many of the
chips are combined in a single memory, the time that at least one
of them fails could decrease to mere few hours. The presence of
the failed chips causes errors when binary data are stored in and
read out from the memory. As a consequence the reliability of the
computer memories degrade. These errors are classified into hard
errors and soft errors. These can also be termed as permanent and
temporary errors respectively.
In some situations errors may show up as random errors, in
which both 1-to-O errors and 0-to-l errors occur randomly in a
memory word. In other situations the most likely errors are
unidirectional errors in which 1-to-O errors or 0-to-l errors may
occur but not both of them in one particular memory word.
To achieve a high speed and highly reliable computer, we need
large capacity memory. Unfortunately, with high density of
semiconductor cells in memory, the error rate increases
dramatically. Especially, the VLSI RAMs suffer from soft errors
caused by alpha-particle radiation. Thus the reliability of
computer could become unacceptable without error reducing schemes.
In practice several schemes to reduce the effects of the memory
errors were commonly used. But most of them are valid only for hard errors. As an efficient and economical method, error control
coding can be used to overcome both hard and soft errors.
Therefore it is becoming a widely used scheme in computer industry
today.
In this thesis, we discuss error control coding for
semiconductor memories. The thesis consists of six chapters.
Chapter one is an introduction to error detecting and correcting
coding for computer memories. Firstly, semiconductor memories and
their problems are discussed. Then some schemes for error reduction
in computer memories are given and the advantages of using error
control coding over other schemes are presented.
In chapter two, after a brief review of memory organizations,
memory cells and their physical constructions and principle of
storing data are described. Then we analyze mechanisms of various
errors occurring in semiconductor memories so that, for different
errors different coding schemes could be selected.
Chapter three is devoted to the fundamental coding theory. In
this chapter background on encoding and decoding algorithms are
presented.
In chapter four, random error control codes are discussed.
Among them error detecting codes, single* error correcting/double
error detecting codes and multiple error correcting codes are
analyzed. By using examples, the decoding implementations for
parity codes, Hamming codes, modified Hamming codes and majority
logic codes are demonstrated. Also in this chapter it was shown
that by combining error control coding and other schemes, the reliability of the memory can be improved by many orders.
For unidirectional errors, we introduced unordered codes in
chapter five. Two types of the unordered codes are discussed. They
are systematic and nonsystematic unordered codes. Both of them are
very powerful for unidirectional error detection. As an example of
optimal nonsystematic unordered code, an efficient balanced code
are analyzed. Then as an example of systematic unordered codes
Berger codes are analyzed. Considering the fact that in practice
random errors still may occur in unidirectional error memories,
some recently developed t-random error correcting/all
unidirectional error detecting codes are introduced. Illustrative
examples are also included to facilitate the explanation.
Chapter six is the conclusions of the thesis.
The whole thesis is oriented to the applications of error
control coding for semiconductor memories. Most of the codes
discussed in the thesis are widely used in practice. Through the
thesis we attempt to provide a review of coding in computer
memories and emphasize the advantage of coding. It is obvious that
with the requirement of higher speed and higher capacity
semiconductor memories, error control coding will play even more
important role in the future
Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack
Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria
- …