1,381 research outputs found

    The Raincore API for clusters of networking elements

    Get PDF
    Clustering technology offers a way to increase overall reliability and performance of Internet information flow by strengthening one link in the chain without adding others. We have implemented this technology in a distributed computing architecture for network elements. The architecture, called Raincore, originated in the Reliable Array of Independent Nodes, or RAIN, research collaboration between the California Institute of Technology and the US National Aeronautics and Space Agency's Jet Propulsion Laboratory. The RAIN project focused on developing high-performance, fault-tolerant, portable clustering technology for spaceborne computing . The technology that emerged from this project became the basis for a spinoff company, Rainfinity, which has the exclusive intellectual property rights to the RAIN technology. The authors describe the Raincore conceptual architecture and distributed services, which are designed to make it easy for developers to port their applications to run on top of a cluster of networking elements. We include two applications: a Web server prototype that was part of the original RAIN research project and a commercial firewall cluster product from Rainfinity

    Fault tolerant packet-switched network design and Its sensitivity

    Get PDF
    Reliability and performance for telecommunication networks have traditionally been investigated separately in spite of their close relation. A design method integrating them for a reliable packet switched network, called a proofing method, is presented. Two heuristic design approaches (max-average, max-delay-link) for optimizing network cost in the proofing method are described. To verify their effectiveness and applicability, they are compared numerically for three example network topologies. The sensitivity of these two methods is examined with respect to changes in traffic demand and in link reliability. The design sensitivity to variation of input data is examined by changing the predicted probability of link failure, and by increasing the network traffic over the predicted value. The resulting analysis shows relative insensitivity of solutions generated by the two design methods to input data</p

    Deep Space Network information system architecture study

    Get PDF
    The purpose of this article is to describe an architecture for the Deep Space Network (DSN) information system in the years 2000-2010 and to provide guidelines for its evolution during the 1990s. The study scope is defined to be from the front-end areas at the antennas to the end users (spacecraft teams, principal investigators, archival storage systems, and non-NASA partners). The architectural vision provides guidance for major DSN implementation efforts during the next decade. A strong motivation for the study is an expected dramatic improvement in information-systems technologies, such as the following: computer processing, automation technology (including knowledge-based systems), networking and data transport, software and hardware engineering, and human-interface technology. The proposed Ground Information System has the following major features: unified architecture from the front-end area to the end user; open-systems standards to achieve interoperability; DSN production of level 0 data; delivery of level 0 data from the Deep Space Communications Complex, if desired; dedicated telemetry processors for each receiver; security against unauthorized access and errors; and highly automated monitor and control

    The Raincore Distributed Session Service for Networking Elements

    Get PDF
    Motivated by the explosive growth of the Internet, we study efficient and fault-tolerant distributed session layer protocols for networking elements. These protocols are designed to enable a network cluster to share the state information necessary for balancing network traffic and computation load among a group of networking elements. In addition, in the presence of failures, they allow network traffic to fail-over from failed networking elements to healthy ones. To maximize the overall network throughput of the networking cluster, we assume a unicast communication medium for these protocols. The Raincore Distributed Session Service is based on a fault-tolerant token protocol, and provides group membership, reliable multicast and mutual exclusion services in a networking environment. We show that this service provides atomic reliable multicast with consistent ordering. We also show that Raincore token protocol consumes less overhead than a broadcast-based protocol in this environment in terms of CPU task-switching. The Raincore technology was transferred to Rainfinity, a startup company that is focusing on software for Internet reliability and performance. Rainwall, Rainfinity’s first product, was developed using the Raincore Distributed Session Service. We present initial performance results of the Rainwall product that validates our design assumptions and goals

    A new fault-tolerant configuration for the Cambridge Ring: the Hierarchical Ring-Star

    Get PDF
    The primary objective of this research is to look at ways of resolving the reliability problems of the Cambridge Ring local area network system. The result is a novel design to enhance the Cambridge Ring with fault tolerance by introducing redundant communication paths with dynamic reconfiguration. The proposed Ring-Star system combines the advantages of ring and star networks to create a network which is topologically resilient while retaining the efficient communication advantage of rings. [Continues.

    Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip

    Get PDF
    High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

    Get PDF
    Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions

    Efficient fault-tolerant routing in multihop optical WDM networks

    Get PDF
    This paper addresses the problem of efficient routing in unreliable multihop optical networks supported by Wavelength Division Multiplexing (WDM). We first define a new cost model for routing in (optical) WDM networks that is more general than the existing models. Our model takes into consideration not only the cost of wavelength access and conversion but also the delay for queuing signals arriving at different input channels that share the same output channel at the same node. We then propose a set of efficient algorithms in a reliable WDM network on the new cost model for each of the three most important communication patterns - multiple point-to-point routing, multicast, and multiple multicast. Finally, we show how to obtain a set of efficient algorithms in an unreliable WDM network with up to f faulty optical channels and wavelength conversion gates. Our strategy is to first enhance the physical paths constructed by the algorithms for reliable networks to ensure success of fault-tolerant routing, and then to route among the enhanced paths to establish a set of fault-free physical routes to complete the corresponding routing request for each of the communication patterns.published_or_final_versio

    SpiNNaker: Fault tolerance in a power- and area- constrained large-scale neuromimetic architecture

    Get PDF
    AbstractSpiNNaker is a biologically-inspired massively-parallel computer designed to model up to a billion spiking neurons in real-time. A full-fledged implementation of a SpiNNaker system will comprise more than 105 integrated circuits (half of which are SDRAMs and half multi-core systems-on-chip). Given this scale, it is unavoidable that some components fail and, in consequence, fault-tolerance is a foundation of the system design. Although the target application can tolerate a certain, low level of failures, important efforts have been devoted to incorporate different techniques for fault tolerance. This paper is devoted to discussing how hardware and software mechanisms collaborate to make SpiNNaker operate properly even in the very likely scenario of component failures and how it can tolerate system-degradation levels well above those expected
    corecore