30 research outputs found

    A Programmable Window Comparator for Analog Online Testing

    Get PDF
    This paper discusses the challenge of designing window comparators for analog online testing applications. A programmable window comparator with adaptive error threshold is presented. Experimental results demonstrate that improved fault detection capability is achieved by using the proposed design. Measurement results of the fabricated comparator circuit are also presented

    On-line Testing Field Programmable Analog Array Circuits

    Get PDF
    This work presents an efficient methodology to on-line test field programmable analog array (FPAA) circuits. It proposes to partition the FPAA circuit under test into sub circuits. Each sub circuit is tested by replicating the sub circuit with programmable resources on FPAAs, and comparing the outputs of the original partitioned sub circuit and its replication. The advantages of this approach includes: low implementation cost, enhanced testability, and flexible testing schedules. This work also presents circuit techniques to address stability problems which are often encountered in the proposed on-line testing approach. In addition, the impact of performing circuit partition on testability is investigated in this work. It shows that testability is generally improved in partitioned circuits. Finally, experimental results are presented to demonstrate the feasibility and effectiveness of the proposed techniques

    A Methodology to Perform Online Self-Testing for Field-Programmable Analog Array Circuits

    Get PDF
    This paper presents a methodology to perform online self-testing for analog circuits implemented on field-programmable analog arrays (FPAAs). It proposes to partition the FPAA circuit under test into subcircuits. Each subcircuit is tested by replicating the subcircuit with programmable resources on the FPAA chip, and comparing the outputs of the subcircuit and its replication. To effectively implement the proposed methodology, this paper proposes a simple circuit partition method and develops techniques to address circuit stability problems that are often encountered in the proposed testing method. Furthermore, error sources in the proposed testing circuit are studied and methods to improve the accuracy of testing results are presented. Finally, experimental results are presented to demonstrate the validity of the proposed methodology

    On Fault Tolerance Methods for Networks-on-Chip

    Get PDF
    Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast

    From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators

    Get PDF
    This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads

    Fault-tolerant computation using algebraic homomorphisms

    Get PDF
    Also issued as Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1992.Includes bibliographical references (p. 193-196).Supported by the Defense Advanced Research Projects Agency, monitored by the U.S. Navy Office of Naval Research. N00014-89-J-1489 Supported by the Charles S. Draper Laboratories. DL-H-418472Paul E. Beckmann

    Fault tolerance in space-based digital signal processing and switching systems: Protecting up-link processing resources, demultiplexer, demodulator, and decoder

    Get PDF
    Fault tolerance features in the first three major subsystems appearing in the next generation of communications satellites are described. These satellites will contain extensive but efficient high-speed processing and switching capabilities to support the low signal strengths associated with very small aperture terminals. The terminals' numerous data channels are combined through frequency division multiplexing (FDM) on the up-links and are protected individually by forward error-correcting (FEC) binary convolutional codes. The front-end processing resources, demultiplexer, demodulators, and FEC decoders extract all data channels which are then switched individually, multiplexed, and remodulated before retransmission to earth terminals through narrow beam spot antennas. Algorithm based fault tolerance (ABFT) techniques, which relate real number parity values with data flows and operations, are used to protect the data processing operations. The additional checking features utilize resources that can be substituted for normal processing elements when resource reconfiguration is required to replace a failed unit
    corecore