39,152 research outputs found

    Computing in the RAIN: a reliable array of independent nodes

    Get PDF
    The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

    CASPR: Judiciously Using the Cloud for Wide-Area Packet Recovery

    Full text link
    We revisit a classic networking problem -- how to recover from lost packets in the best-effort Internet. We propose CASPR, a system that judiciously leverages the cloud to recover from lost or delayed packets. CASPR supplements and protects best-effort connections by sending a small number of coded packets along the highly reliable but expensive cloud paths. When receivers detect packet loss, they recover packets with the help of the nearby data center, not the sender, thus providing quick and reliable packet recovery for latency-sensitive applications. Using a prototype implementation and its deployment on the public cloud and the PlanetLab testbed, we quantify the benefits of CASPR in providing fast, cost effective packet recovery. Using controlled experiments, we also explore how these benefits translate into improvements up and down the network stack
    corecore