23 research outputs found
Volume 71, Number 21, March 21, 1952
14th Turkish Symposium on Artificial Intelligence and Neural Networks -- JUN 16-17, 2005 -- Izmir, TURKEYWOS: 000239585200025Replication of data or processes is an effective way to provide enhanced performance, high availability and fault tolerance in distributed systems. For instance, in systems based on the client-server model, a server may serve many clients and because of heavy loads, the server cannot respond to the requests on time. In such a case, replicating data or servers may improve performance. Moreover, data and processes can be replicated to protect against failures. However, this is a very complex procedure. In this paper, I propose a method, to make systems fault tolerant based on replication, by way of exploiting the use of collaborative agents. This method is also used to improve fault tolerance in multi-agent systems.Izmir Inst Technol, EE & CE Depts, Turkish Sci & Res Council, Izmir Branch Chamber Elect & Elect Engineer
Improving the scalability of cloud-based resilient database servers
Many rely now on public cloud infrastructure-as-a-service for
database servers, mainly, by pushing the limits of existing pooling and
replication software to operate large shared-nothing virtual server clusters.
Yet, it is unclear whether this is still the best architectural choice,
namely, when cloud infrastructure provides seamless virtual shared storage
and bills clients on actual disk usage.
This paper addresses this challenge with Resilient Asynchronous Commit
(RAsC), an improvement to awell-known shared-nothing design based
on the assumption that a much larger number of servers is required for
scale than for resilience. Then we compare this proposal to other database
server architectures using an analytical model focused on peak throughput
and conclude that it provides the best performance/cost trade-off while at
the same time addressing a wide range of fault scenarios
Solving atomic multicast when groups crash
In this paper, we study the atomic multicast problem, a fundamental abstraction for building faulttolerant systems. In the atomic multicast problem, the system is divided into non-empty and disjoint groups of processes. Multicast messages may be addressed to any subset of groups, each message possibly being multicast to a different subset. Several papers previously studied this problem either in local area networks [3, 9, 20] or wide area networks [13, 21]. However, none of them considered atomic multicast when groups may crash. We present two atomic multicast algorithms that tolerate the crash of groups. The first algorithm tolerates an arbitrary number of failures, is genuine (i.e., to deliver a message m, only addressees of m are involved in the protocol), and uses the perfect failures detector P. We show that among realistic failure detectors, i.e., those that do not predict the future, P is necessary to solve genuine atomic multicast if we do not bound the number of processes that may fail. Thus, P is the weakest realistic failure detector for solving genuine atomic multicast when an arbitrary number of processes may crash. Our second algorithm is non-genuine and less resilient to process failures than the first algorithm but has several advantages: (i) it requires perfect failure detection within groups only, and not across the system, (ii) as we show in the paper it can be modified to rely on unreliable failure detection at the cost of a weaker liveness guarantee, and (iii) it is fast, messages addressed to multiple groups may be delivered within two inter-group message delays only
An Architecture for Self-healing Autonomous Object Groups
Abstract. Jgroup/ARM is a middleware for developing and operating dependable distributed Java applications. Jgroup integrates the distributed object model of Java RMI with the object group paradigm, enabling construction of replicated servers that offer dependable services to clients. ARM aims to improve the dependability characteristics of systems through fault treatment, focusing on operational aspects where the gain in terms of improved dependability is likely to be the greatest. ARM offers two core mechanisms: recovery from node, object and network failures and distribution of replicas. ARM identifies failures and reconfigures the system according to its dependability requirements. This paper proposes an enhancement of the ARM framework in which replica placement is performed in a distributed manner, eliminating the need for a centralized manager with global information about all object groups. Instead each autonomous object group handles their own replica placement based on information from nodes. Assuming that multiple objects groups are deployed in the system, this constitutes a distributed replica placement scheme. This scheme enables the implementation of self-healing object groups that can perform fault treatment on themselves. Advantages of the approach: (a) no need to maintain global information about all object groups which is costly and limits scalability, (b) reduced infrastructure complexity, and (c) less communication overhead.
Aquarius: A Data-Centric approach to CORBA Fault-Tolerance
The Internet provides abundant opportunity to share resources, and form commerce and business relationships. Key to sharing information and performing collaborative tasks are tools that meet client demands for reliability, high availability, and responsiveness. Many techniques for high availability and for load balancing were developed aiming at small t
Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs
Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles)