Search CORE

21,755 research outputs found

Hybrid Message Logging. Combining advantages of Sender-based and Receiver-based Approaches

Author: Luque Emilio
Meyer Hugo
Rexachs Dolores
Publication venue: The Authors. Published by Elsevier B.V.
Publication date: 31/12/2014
Field of study

AbstractWith the growing scale of High Performance Computing applications comes an increase in the number of interruptions as a consequence of hardware failures. As the tendency is to scale parallel executions to hundred of thousands of processes, fault tolerance is becoming an important matter. Uncoordinated fault tolerance protocols, such as message logging, seem to be the best option since coordinated protocols might compromise applications scalability. Considering that most of the overhead during failure-free executions is caused by message logging approaches, in this paper we propose a Hybrid Message Logging protocol. It focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low protection overhead introduced by pessimistic sender-based message logging. The Hybrid Message Logging aims to reduce the overhead introduced by pessimistic receiver-based approaches by allowing applications to continue normally before a received message is properly saved. In order to guarantee that no message is lost, a pessimistic sender-based logging is used to temporarily save messages while the receiver fully saves its received messages. Experiments have shown that we can achieve up to 43% overhead reduction compared to a pessimistic receiver- based logging approach

Elsevier - Publisher Connector

A method for analyzing the performance aspects of the fault-tolerance mechanisms in FDDI

Author: Haverkort Boudewijn R.
Moorsel Aad P.A. van
Niemegeers Ignas G.
Publication venue: IEEE
Publication date: 01/01/1992
Field of study

The ability of error recovery mechanisms to make the Fiber Distributed Data Interface (FDDI) satisfy real-time performance constraints in the presence of errors is analyzed. A complicating factor in these analyses is the rarity of the error occurrences, which makes direct simulation unattractive. Therefore, a fast simulation technique, called injection simulation, which makes it possible to analyze the performance of FDDI, including its fault tolerance behavior, was developed. The implementation of injection simulation for polling models of FDDI is discussed, along with simulation result

University of Twente Research Information

Recommended from our members

On designing dependable services with diverse off-the-shelf SQL servers

Author: A. Avizienis
A. Avizienis
A. Vaysburd
B. Kemme
C. Babbage
D. Powell
F. Pedone
F. Schneider
I. Gashi
J. Gray
J. Gray
J.C. Laprie
M. Patino-Martinez
M. Weismann
P. Popov
P.A. Bernstein
P.E. Ammann
P.J. Traverse
P.M. Chen
R. Jimenez-Peris
R. Jimenez-Peris
S. Chandra
S. Chandra
S. Poledna
S. Poledna
T. Anderson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

City Research Online

Crossref

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors