Search CORE

10,083 research outputs found

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

Universal blind quantum computation

Author: Broadbent Anne
Fitzsimons Joseph
Kashefi Elham
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

We present a protocol which allows a client to have a server carry out a quantum computation for her such that the client's inputs, outputs and computation remain perfectly private, and where she does not require any quantum computational power or memory. The client only needs to be able to prepare single qubits randomly chosen from a finite set and send them to the server, who has the balance of the required quantum computational resources. Our protocol is interactive: after the initial preparation of quantum states, the client and server use two-way classical communication which enables the client to drive the computation, giving single-qubit measurement instructions to the server, depending on previous measurement outcomes. Our protocol works for inputs and outputs that are either classical or quantum. We give an authentication protocol that allows the client to detect an interfering server; our scheme can also be made fault-tolerant. We also generalize our result to the setting of a purely classical client who communicates classically with two non-communicating entangled servers, in order to perform a blind quantum computation. By incorporating the authentication protocol, we show that any problem in BQP has an entangled two-prover interactive proof with a purely classical verifier. Our protocol is the first universal scheme which detects a cheating server, as well as the first protocol which does not require any quantum computation whatsoever on the client's side. The novelty of our approach is in using the unique features of measurement-based quantum computing which allows us to clearly distinguish between the quantum and classical aspects of a quantum computation.Comment: 20 pages, 7 figures. This version contains detailed proofs of authentication and fault tolerance. It also contains protocols for quantum inputs and outputs and appendices not available in the published versio

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Evaluation of fault-tolerant parallel-processor architectures over long space missions

Author: Johnson Sally C.
Publication venue
Publication date
Field of study

The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration

NASA Technical Reports Server

What is a quantum computer, and how do we build one?

Author: Kok Pieter
Perez-Delgado Carlos A.
Publication venue: 'American Physical Society (APS)'
Publication date: 25/05/2010
Field of study

The DiVincenzo criteria for implementing a quantum computer have been seminal in focussing both experimental and theoretical research in quantum information processing. These criteria were formulated specifically for the circuit model of quantum computing. However, several new models for quantum computing (paradigms) have been proposed that do not seem to fit the criteria well. The question is therefore what are the general criteria for implementing quantum computers. To this end, a formal operational definition of a quantum computer is introduced. It is then shown that according to this definition a device is a quantum computer if it obeys the following four criteria: Any quantum computer must (1) have a quantum memory; (2) facilitate a controlled quantum evolution of the quantum memory; (3) include a method for cooling the quantum memory; and (4) provide a readout mechanism for subsets of the quantum memory. The criteria are met when the device is scalable and operates fault-tolerantly. We discuss various existing quantum computing paradigms, and how they fit within this framework. Finally, we lay out a roadmap for selecting an avenue towards building a quantum computer. This is summarized in a decision tree intended to help experimentalists determine the most natural paradigm given a particular physical implementation

arXiv.org e-Print Archive

CiteSeerX

Kent Academic Repository

Designing application software in wide area network settings

Author: Birman Ken
Makpangou Mesaac
Publication venue
Publication date: 01/01/1990
Field of study

Progress in methodologies for developing robust local area network software has not been matched by similar results for wide area settings. The design of application software spanning multiple local area environments is examined. For important classes of applications, simple design techniques are presented that yield fault tolerant wide area programs. An implementation of these techniques as a set of tools for use within the ISIS system is described

INRIA a CCSD electronic archive server

NASA Technical Reports Server

eCommons@Cornell

The Raincore Distributed Session Service for Networking Elements

Author: Bruck Jehoshua
Fan Chenggong Charles
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2001
Field of study

Motivated by the explosive growth of the Internet, we study efficient and fault-tolerant distributed session layer protocols for networking elements. These protocols are designed to enable a network cluster to share the state information necessary for balancing network traffic and computation load among a group of networking elements. In addition, in the presence of failures, they allow network traffic to fail-over from failed networking elements to healthy ones. To maximize the overall network throughput of the networking cluster, we assume a unicast communication medium for these protocols. The Raincore Distributed Session Service is based on a fault-tolerant token protocol, and provides group membership, reliable multicast and mutual exclusion services in a networking environment. We show that this service provides atomic reliable multicast with consistent ordering. We also show that Raincore token protocol consumes less overhead than a broadcast-based protocol in this environment in terms of CPU task-switching. The Raincore technology was transferred to Rainfinity, a startup company that is focusing on software for Internet reliability and performance. Rainwall, Rainfinity’s first product, was developed using the Raincore Distributed Session Service. We present initial performance results of the Rainwall product that validates our design assumptions and goals

Caltech Authors

Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

Author: Harper Richard
Publication venue
Publication date
Field of study

In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described

NASA Technical Reports Server