35,566 research outputs found
Redundant Logic Insertion and Fault Tolerance Improvement in Combinational Circuits
This paper presents a novel method to identify and insert redundant logic
into a combinational circuit to improve its fault tolerance without having to
replicate the entire circuit as is the case with conventional redundancy
techniques. In this context, it is discussed how to estimate the fault masking
capability of a combinational circuit using the truth-cum-fault enumeration
table, and then it is shown how to identify the logic that can introduced to
add redundancy into the original circuit without affecting its native
functionality and with the aim of improving its fault tolerance though this
would involve some trade-off in the design metrics. However, care should be
taken while introducing redundant logic since redundant logic insertion may
give rise to new internal nodes and faults on those may impact the fault
tolerance of the resulting circuit. The combinational circuit that is
considered and its redundant counterparts are all implemented in semi-custom
design style using a 32/28nm CMOS digital cell library and their respective
design metrics and fault tolerances are compared
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Recommended from our members
Fault tolerance in super-scalar and VLIW processors
In this paper, we present a method for utilizing the spare capacity in super-scalar and very long instruction word (VLIW) processors to tolerate functional unit failures. Unlike previous work that was primarily interested in detection of transient faults, we are concerned with more permanent and/or intermittent faults which necessitate processor reconfiguration. Our method utilizes the VLIW compiler or the superscalar scheduler to insert redundant operations whenever idle functional units exist. The results of these redundant operations are used to detect and diagnose functional unit failures. For super-scalar processors, the scheduler can then utilize this information to ensure that operations are performed only on non-faulty units. In VLIW processors, this is equivalent to recompiling the code to run on the remaining non-faulty functional units. Since in certain applications, recompilation may not be possible, we consider two alternative reconfiguration strategies for VLIW processors. These strategies sacrifice storage space and execution time, respectively, in order to reconfigure without recompiling. We present Markov models that describe the behavior of processors using these different approaches and we evaluate their reliabilities. The results show that, while super-scalar and VLIW with recompilation provide the highest reliability, all proposed strategies significantly increase reliability over that of an unprotected processor
Model Prediction-Based Approach to Fault Tolerant Control with Applications
Abstract— Fault-tolerant control (FTC) is an integral component in industrial processes as it enables the system to continue robust operation under some conditions. In this paper, an FTC scheme is proposed for interconnected systems within an integrated design framework to yield a timely monitoring and detection of fault and reconfiguring the controller according to those faults. The unscented Kalman filter (UKF)-based fault detection and diagnosis system is initially run on the main plant and parameter estimation is being done for the local faults. This critical information\ud
is shared through information fusion to the main system where the whole system is being decentralized using the overlapping decomposition technique. Using this parameter estimates of decentralized subsystems, a model predictive control (MPC) adjusts its parameters according to the\ud
fault scenarios thereby striving to maintain the stability of the system. Experimental results on interconnected continuous time stirred tank reactors (CSTR) with recycle and quadruple tank system indicate that the proposed method is capable to correctly identify various faults, and then controlling the system under some conditions
On the reliability of electrical drives for safety-critical applications
The aim of this work is to present some issues related to fault tolerant electric drives,which are able to overcome different types of faults occurring in the sensors, in thepower converter and in the electrical machine, without compromising the overallfunctionality of the system. These features are of utmost importance in safety-criticalapplications. In this paper, the reliability of both commercial and innovative driveconfigurations, which use redundant hardware and suitable control algorithms, will beinvestigated for the most common types of fault: besides standard three phase motordrives, also multiphase topologies, open-end winding solutions, multi-machineconfigurations will be analyzed, applied to various electric motor technologies. Thecomplexity of hardware and control strategies will also be compared in this paper, sincethis has a tremendous impact on the investment costs
On quantifying fault patterns of the mesh interconnect networks
One of the key issues in the design of Multiprocessors System-on-Chip (MP-SoCs), multicomputers, and peerto- peer networks is the development of an efficient communication network to provide high throughput and low latency and its ability to survive beyond the failure of individual components. Generally, the faulty components may be coalesced into fault regions, which are classified into convex and concave shapes. In this paper, we propose a mathematical solution for counting the number of common fault patterns in a 2-D mesh interconnect network including both convex (|-shape, | |-shape, ý-shape) and concave (L-shape, Ushape, T-shape, +-shape, H-shape) regions. The results presented in this paper which have been validated through simulation experiments can play a key role when studying, particularly, the performance analysis of fault-tolerant routing algorithms and measure of a network fault-tolerance expressed as the probability of a disconnection
- …