32,897 research outputs found
Fault Secure Encoder and Decoder for NanoMemory Applications
Memory cells have been protected from soft errors for more than a decade; due to the increase in soft error rate in logic circuits, the encoder and decoder circuitry around the memory blocks have become susceptible to soft errors as well and must also be protected. We introduce a new approach to design fault-secure encoder and decoder circuitry for memory designs. The key novel contribution of this paper is identifying and defining a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple. We further quantify the importance of protecting encoder and decoder circuitry against transient errors, illustrating a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder. We prove that Euclidean geometry low-density parity-check (EG-LDPC) codes have the fault-secure detector capability. Using some of the smaller EG-LDPC codes, we can tolerate bit or nanowire defect rates of 10% and fault rates of 10^(-18) upsets/device/cycle, achieving a FIT rate at or below one for the entire memory system and a memory density of 10^(11) bit/cm^2 with nanowire pitch of 10 nm for memory blocks of 10 Mb or larger. Larger EG-LDPC codes can achieve even higher reliability and lower area overhead
Advanced techniques in reliability model representation and solution
The current tendency of flight control system designs is towards increased integration of applications and increased distribution of computational elements. The reliability analysis of such systems is difficult because subsystem interactions are increasingly interdependent. Researchers at NASA Langley Research Center have been working for several years to extend the capability of Markov modeling techniques to address these problems. This effort has been focused in the areas of increased model abstraction and increased computational capability. The reliability model generator (RMG) is a software tool that uses as input a graphical object-oriented block diagram of the system. RMG uses a failure-effects algorithm to produce the reliability model from the graphical description. The ASSURE software tool is a parallel processing program that uses the semi-Markov unreliability range evaluator (SURE) solution technique and the abstract semi-Markov specification interface to the SURE tool (ASSIST) modeling language. A failure modes-effects simulation is used by ASSURE. These tools were used to analyze a significant portion of a complex flight control system. The successful combination of the power of graphical representation, automated model generation, and parallel computation leads to the conclusion that distributed fault-tolerant system architectures can now be analyzed
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
On the reliability of electrical drives for safety-critical applications
The aim of this work is to present some issues related to fault tolerant electric drives,which are able to overcome different types of faults occurring in the sensors, in thepower converter and in the electrical machine, without compromising the overallfunctionality of the system. These features are of utmost importance in safety-criticalapplications. In this paper, the reliability of both commercial and innovative driveconfigurations, which use redundant hardware and suitable control algorithms, will beinvestigated for the most common types of fault: besides standard three phase motordrives, also multiphase topologies, open-end winding solutions, multi-machineconfigurations will be analyzed, applied to various electric motor technologies. Thecomplexity of hardware and control strategies will also be compared in this paper, sincethis has a tremendous impact on the investment costs
Wildcard dimensions, coding theory and fault-tolerant meshes and hypercubes
Hypercubes, meshes and tori are well known interconnection networks for parallel computers. The sets of edges in those graphs can be partitioned to dimensions. It is well known that the hypercube can be extended by adding a wildcard dimension resulting in a folded hypercube that has better fault-tolerant and communication capabilities. First we prove that the folded hypercube is optimal in the sense that only a single wildcard dimension can be added to the hypercube. We then investigate the idea of adding wildcard dimensions to d-dimensional meshes and tori. Using techniques from error correcting codes we construct d-dimensional meshes and tori with wildcard dimensions. Finally, we show how these constructions can be used to tolerate edge and node faults in mesh and torus networks
Introduction to the special section on dependable network computing
Dependable network computing is becoming a key part of our daily economic and social life. Every day, millions of users and businesses are utilizing the Internet infrastructure for real-time electronic commerce transactions, scheduling important events, and building relationships. While network traffic and the number of users are rapidly growing, the mean-time between failures (MTTF) is surprisingly short; according to recent studies, in the majority of Internet backbone paths, the MTTF is 28 days. This leads to a strong requirement for highly dependable networks, servers, and software systems. The challenge is to build interconnected systems, based on available technology, that are inexpensive, accessible, scalable, and dependable. This special section provides insights into a number of these exciting challenges
- …