3,509 research outputs found

    Redundancy management for efficient fault recovery in NASA's distributed computing system

    Get PDF
    The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance

    On Fault Tolerance Methods for Networks-on-Chip

    Get PDF
    Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast

    On the Effect of Quantum Interaction Distance on Quantum Addition Circuits

    Full text link
    We investigate the theoretical limits of the effect of the quantum interaction distance on the speed of exact quantum addition circuits. For this study, we exploit graph embedding for quantum circuit analysis. We study a logical mapping of qubits and gates of any Ω(logn)\Omega(\log n)-depth quantum adder circuit for two nn-qubit registers onto a practical architecture, which limits interaction distance to the nearest neighbors only and supports only one- and two-qubit logical gates. Unfortunately, on the chosen kk-dimensional practical architecture, we prove that the depth lower bound of any exact quantum addition circuits is no longer Ω(logn)\Omega(\log {n}), but Ω(nk)\Omega(\sqrt[k]{n}). This result, the first application of graph embedding to quantum circuits and devices, provides a new tool for compiler development, emphasizes the impact of quantum computer architecture on performance, and acts as a cautionary note when evaluating the time performance of quantum algorithms.Comment: accepted for ACM Journal on Emerging Technologies in Computing System

    Large Scale Modular Quantum Computer Architecture with Atomic Memory and Photonic Interconnects

    Full text link
    The practical construction of scalable quantum computer hardware capable of executing non-trivial quantum algorithms will require the juxtaposition of different types of quantum systems. We analyze a modular ion trap quantum computer architecture with a hierarchy of interactions that can scale to very large numbers of qubits. Local entangling quantum gates between qubit memories within a single register are accomplished using natural interactions between the qubits, and entanglement between separate registers is completed via a probabilistic photonic interface between qubits in different registers, even over large distances. We show that this architecture can be made fault-tolerant, and demonstrate its viability for fault-tolerant execution of modest size quantum circuits

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified

    Improving Scalability and Usability of Parallel Runtime Environments for High Availability and High Performance Systems

    Get PDF
    The number of processors embedded in high performance computing platforms is growing daily to solve larger and more complex problems. Hence, parallel runtime environments have to support and adapt to the underlying platforms that require scalability and fault management in more and more dynamic environments. This dissertation aims to analyze, understand and improve the state of the art mechanisms for managing highly dynamic, large scale applications. This dissertation demonstrates that the use of new scalable and fault-tolerant topologies, combined with rerouting techniques, builds parallel runtime environments, which are able to efficiently and reliably deliver sets of information to a large number of processes. Several important graph properties are provided to illustrate the theoretical capability of these topologies in terms of both scalability and fault-tolerance, such as reasonable degree, regular graph, low diameter, symmetric graph, low cost factor, low message traffic density, optimal connectivity, low fault-diameter and strongly resilient. The dissertation builds a communication framework based on these topologies to support parallel runtime environments. Such a framework can handle multiple types of messages, e.g., unicast, multicast, broadcast and all-gather. Additionally, the communication framework has been formally verified to work in both normal and failure circumstances without creating any of the common problems such as broadcast storm, deadlock and non-progress cycle

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Network-on-Chip

    Get PDF
    Limitations of bus-based interconnections related to scalability, latency, bandwidth, and power consumption for supporting the related huge number of on-chip resources result in a communication bottleneck. These challenges can be efficiently addressed with the implementation of a network-on-chip (NoC) system. This book gives a detailed analysis of various on-chip communication architectures and covers different areas of NoCs such as potentials, architecture, technical challenges, optimization, design explorations, and research directions. In addition, it discusses current and future trends that could make an impactful and meaningful contribution to the research and design of on-chip communications and NoC systems
    corecore