Search CORE

575 research outputs found

The Raincore Distributed Session Service for Networking Elements

Author: Bruck Jehoshua
Fan Chenggong Charles
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2001
Field of study

Motivated by the explosive growth of the Internet, we study efficient and fault-tolerant distributed session layer protocols for networking elements. These protocols are designed to enable a network cluster to share the state information necessary for balancing network traffic and computation load among a group of networking elements. In addition, in the presence of failures, they allow network traffic to fail-over from failed networking elements to healthy ones. To maximize the overall network throughput of the networking cluster, we assume a unicast communication medium for these protocols. The Raincore Distributed Session Service is based on a fault-tolerant token protocol, and provides group membership, reliable multicast and mutual exclusion services in a networking environment. We show that this service provides atomic reliable multicast with consistent ordering. We also show that Raincore token protocol consumes less overhead than a broadcast-based protocol in this environment in terms of CPU task-switching. The Raincore technology was transferred to Rainfinity, a startup company that is focusing on software for Internet reliability and performance. Rainwall, Rainfinity’s first product, was developed using the Raincore Distributed Session Service. We present initial performance results of the Rainwall product that validates our design assumptions and goals

Caltech Authors

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

A Configurable Transport Layer for CAF

Author: Adelstein F.
Amir Y.
Charousset Dominik
Charousset Dominik
Hewitt Carl
Iyengar Jana
Torquati Massimo
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/09/2018
Field of study

The message-driven nature of actors lays a foundation for developing scalable and distributed software. While the actor itself has been thoroughly modeled, the message passing layer lacks a common definition. Properties and guarantees of message exchange often shift with implementations and contexts. This adds complexity to the development process, limits portability, and removes transparency from distributed actor systems. In this work, we examine actor communication, focusing on the implementation and runtime costs of reliable and ordered delivery. Both guarantees are often based on TCP for remote messaging, which mixes network transport with the semantics of messaging. However, the choice of transport may follow different constraints and is often governed by deployment. As a first step towards re-architecting actor-to-actor communication, we decouple the messaging guarantees from the transport protocol. We validate our approach by redesigning the network stack of the C++ Actor Framework (CAF) so that it allows to combine an arbitrary transport protocol with additional functions for remote messaging. An evaluation quantifies the cost of composability and the impact of individual layers on the entire stack

arXiv.org e-Print Archive

Crossref

REPOSIT

Algorithm-dependent fault tolerance for distributed computing

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Building high-performance web-caching servers

Author: Arshinov Alex
Publication venue: 'De Montfort University'
Publication date: 01/01/2004
Field of study

De Montfort University Open Research Archive

Totally Ordered Broadcast and Multicast Algorithms: A Comprehensive Survey

Author: Défago Xavier
Schiper André
Urbán Péter
Publication venue
Publication date: 20/05/2005
Field of study

Total order multicast algorithms constitute an important class of problems in distributed systems, especially in the context of fault-tolerance. In short, the problem of total order multicast consists in sending messages to a set of processes, in such a way that all messages are delivered by all correct destinations in the same order. However, the huge amount of literature on the subject and the plethora of solutions proposed so far make it difficult for practitioners to select a solution adapted to their specific problem. As a result, naive solutions are often used while better solutions are ignored. This paper proposes a classification of total order multicast algorithms based on the ordering mechanism of the algorithms, and describes a set of common characteristics (e.g., assumptions, properties) with which to evaluate them. In this classification, more than fifty total order broadcast and multicast algorithms are surveyed. The presentation includes asynchronous algorithms as well as algorithms based on the more restrictive synchronous model. Fault-tolerance issues are also considered as the paper studies the properties and behavior of the different algorithms with respect to failures

Infoscience - École polytechnique fédérale de Lausanne

The TOTEM Experiment at the CERN Large Hadron Collider

Author: Anelli G
Antchev G
Aspell P
Avati V
Bagliesi M G
Berardi V
Berretti M
Boccone V
Bottigli U
Bozzo M
Brucken E
Buzzo A
Cafagna F
Calicchio M
Capurro F
Catanesi M G
Catastini P L
Cecchi R
Cerchi S
Cereseto R
Ciocci M A
Cuneo S
Da Vià C
David E
Deile M
Dimovasili E
Doubrava M
Eggert K
Eremin V
Ferro F
Foussat A
Galuska M
García F
Gherarducci F
Giani S
Greco V
Hasi J
Haug F
Heino J
Hilden T
Jarron P
Joram C
Kalliopuska J
Kaplon J
Kaspar J
Kundrát V
Kurvinen K
Lacroix J M
Lami S
Latino G
Lauhakangas R
Lippmaa E
Lo Vetere M
Lokajícek M
Lucas Rodriguez F
Macina D
Macri M
Magazzù C
Magazzù G
Magri A
Maire G
Manco A
Meucci M
Minutoli S
Morelli A
Musico P
Negri M
Niewiadomski H
Noschis E
Notarnicola G
Oliveri E
Oljemark F
Orava R
Oriunno M
Paoletti R
Pedreschi E
Perrot A L
Petajajarvi J
Pollovio P
Quinto M
Radermacher E
Radicioni E
Rangod Stephane
Ravotti F
Rella G
Robutti E
Ropelewski Leszek
Ruggiero G
Rummel A
Saarikko H
Sanguinetti G
Santroni A
Scribano A
Sette G
Snoeys W
Spinella F
Squillacioti P
Ster A
Taylor C
Tazzioli A
Torazza D
Trovato A
Trummal A
Turini N
Vacek V
Van Remortel N
Vins V
Watts S
Whitmore J
Wu J
Österberg K
Publication venue: 'IOP Publishing'
Publication date: 01/01/2008
Field of study

The TOTEM Experiment will measure the total pp cross-section with the luminosity independent method and study elastic and diffractive scattering at the LHC. To achieve optimum forward coverage for charged particles emitted by the pp collisions in the interaction point IP5, two tracking telescopes, T1 and T2, will be installed on each side in the pseudorapidity region 3,1 <h< 6,5, and Roman Pot stations will be placed at distances of 147m and 220m from IP5. Being an independent experiment but technically integrated into CMS, TOTEM will first operate in standalone mode to pursue its own physics programme and at a later stage together with CMS for a common physics programme. This article gives a description of the TOTEM apparatus and its performance

Archivio della Ricerca - Università degli Studi di Siena

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università di Genova

CERN Document Server