4 research outputs found

    Robust TCP Connections for Fault Tolerant Computing

    Get PDF
    When processes on two different machines communicate, they most often do so using the TCP protocol. While TCP is appropriate for a wide range of applications, it has shortcomings in other application areas. One of these areas is fault tolerant distributed computing. For some of those applications, TCP does not address link failures adequately: TCP breaks the connection if connectivity is lost for some duration (typically minutes). This is sometimes undesirable. The paper proposes robust TCP connections, a solution to the problem of broken TCP connections. The paper presents a session layer protocol on top of TCP that ensures reconnection, and provides exactly-once delivery for all transmitted data. A prototype has been implemented as a Java library. The prototype has less than 10% overhead on TCP sockets with respect to the most important performance figures

    Atomic broadcast:a fault-tolerant token based algorithm and performance evaluations

    Get PDF
    Within only a couple of generations, the so-called digital revolution has taken the world by storm: today, almost all human beings interact, directly or indirectly, at some point in their life, with a computer system. Computers are present on our desks, computer systems control the antilock braking system and the stability control in cars, they collect usage statistics in elevators in order to anticipate maintenance and repair operations. Computer systems also operate critical systems, such as nuclear power plants, airplane control systems or space rockets. Furthermore, computer systems are not only omnipresent, but also increasingly networked. As the use of computer systems has increased dramatically over the past decades, the needs and expectations associated with these systems have also increased. In particular, one of the critical points of a system is its availability (the fraction of the time during which the system provides a service to the users): the costs and negative publicity of a system outage (be it a commercial web site or a stock exchange for example) are often considerable. Fault tolerance is one of the approaches to designing a highly-available system: a fault tolerant system is designed in such a way that the failure of one of the components of the system does not compromise the functionality of the system as a whole. Replication is one of the common fault tolerance techniques. Instead of having a single machine (a replica) providing a service, the system is composed of several replicas running the service and connected through a network. If one of the replicas fails, the service is still provided by the remaining replicas. The replication technique is interesting as it can be achieved by using software running on commodity hardware, thus avoiding the high cost of special purpose hardware. Replication, although intuitive to understand, is complex to implement in practice, as the replicas have to interact in order to ensure the consistency of the system as a whole. Group communication simplifies the replication problem, by hiding issues such as the communication between the replicas, the crashes of one or several replicas and the synchronization of the replicas. In this thesis, we start by comparing two replication techniques – group communication and quorum systems – and identifying in which case either technique should be used. Atomic broadcast (a group communication primitive at the heart of this work) allows replicas to broadcast messages to each other and then deliver them in the same total order, even if replicas broadcast messages quasi simultaneously. Atomic broadcast is especially useful for replication: since all replicas deliver messages in the same order, their state is kept consistent. After the comparison between the replication techniques, we present an atomic broadcast algorithm designed to perform well when the system is heavily loaded and that allows to quickly detect crashed replicas (by minimizing the consequences of wrongly suspecting a non-crashed replica). The presentation of the algorithm includes simulation results comparing the performance of the new algorithm to previously proposed atomic broadcast algorithms. The second part of the thesis focuses on the experimental performance evaluation of the new algorithm in several settings. We start by comparing four atomic broadcast algorithms in a local area network. We then compare three of the four algorithms in a wide area network, with sites in Switzerland, Japan and France, and where the round trip time between the sites varies between 4 and 300 ms. Finally, we evaluate the impact of the size of the system (the number if replicas) on the performance of the algorithms

    Fault Detection Using Hints from the Socket Layer

    No full text
    Fault detection in distributed systems is usually accomplished using a variation of the polling or watch-dog techniques. With these techniques, however, a tradeoff has to be made between the speed of fault detection and performance. In this paper we examine the effectiveness of a fault detection mechanism that uses the error codes returned by the socket functions to locate process failures. Since these errors are generated automatically when there is communication with the failed process, the mechanism does not incur in any overheads. However, for certain types of faults, detection can only be attained if the surviving processes use a special pattern of communication. To assess the coverage and latency of the proposed mechanism, faults were injected during the execution of two parallel applications, a particle simulator and a distributed raytracer. Our results show that in most cases, faults could be found using only the errors from the socket layer. Depending on the type of fault that..
    corecore