4,193 research outputs found

    Fault-tolerant computer study

    Get PDF
    A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed

    Redundancy management for efficient fault recovery in NASA's distributed computing system

    Get PDF
    The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance

    Adaptive weight estimator for quantum error correction

    Get PDF
    Quantum error correction of a surface code or repetition code requires the pairwise matching of error events in a space-time graph of qubit measurements, such that the total weight of the matching is minimized. The input weights follow from a physical model of the error processes that affect the qubits. This approach becomes problematic if the system has sources of error that change over time. Here we show how the weights can be determined from the measured data in the absence of an error model. The resulting adaptive decoder performs well in a time-dependent environment, provided that the characteristic time scale τenv\tau_{\mathrm{env}} of the variations is greater than δt/pˉ\delta t/\bar{p}, with δt\delta t the duration of one error-correction cycle and pˉ\bar{p} the typical error probability per qubit in one cycle.Comment: 5 pages, 4 figure

    A Real-Time Error Detection (RTD) architecture and its use for reliability and post-silicon validation for F/F based memory arrays

    Get PDF
    This work proposes in-situ Real-Time Error Detection (RTD): embedding hardware in a memory array for detecting a fault in the array when it occurs, rather than when it is read. RTD breaks the serialization between data access and error-detection and, thus, it can speed-up the access-time of arrays that use in-line error-correction. The approach can also reduce the time needed to root-cause array related bugs during post-silicon validation and product testing. The paper introduces a two-dimensional error-correction scheme based on RTD and, also, presents a proactive error-correction method that combines RTD with demand-scrubbing. The work describes how to build RTD into a memory array with flip-flops to track in real-time the column-parity. A comparison of the proposed two-dimensional ECC scheme, as compared to single-error-correction-double-error-detection, shows that the RTD design has comparable error-detection-and-correction strength and, depending on the array dimensions and configuration, RTD reduces access time by 4% to 26% at an area and power overhead (negative is a reduction) between -7% to 33% and -42% to 86% respectively.Peer ReviewedPostprint (author's final draft

    Online Bit Flip Detection for In-Memory B-Trees on Unreliable Hardware

    Get PDF
    Hardware vendors constantly decrease the feature sizes of integrated circuits to obtain better performance and energy efficiency. Due to cosmic rays, low voltage or heat dissipation, hardware -- both processors and memory -- becomes more and more unreliable as the error rate increases. From a database perspective bit flip errors in main memory will become a major challenge for modern in-memory database systems, which keep all their enterprise data in volatile, unreliable main memory. Although existing hardware error control techniques like ECC-DRAM are able to detect and correct memory errors, their detection and correction capabilities are limited. Moreover, hardware error correction faces major drawbacks in terms of acquisition costs, additional memory utilization, and latency. In this paper, we argue that slightly increasing data redundancy at the right places by incorporating context knowledge already increases error detection significantly. We use the B-Tree -- as a widespread index structure -- as an example and propose various techniques for online error detection and thus increase its overall reliability. In our experiments, we found that our techniques can detect more errors in less time on commodity hardware compared to non-resilient B-Trees running in an ECC-DRAM environment. Our techniques can further be easily adapted for other data structures and are a first step in the direction of resilient database systems which can cope with unreliable hardware

    Computing in the RAIN: a reliable array of independent nodes

    Get PDF
    The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology
    corecore