23 research outputs found

    Volume 71, Number 21, March 21, 1952

    Get PDF
    14th Turkish Symposium on Artificial Intelligence and Neural Networks -- JUN 16-17, 2005 -- Izmir, TURKEYWOS: 000239585200025Replication of data or processes is an effective way to provide enhanced performance, high availability and fault tolerance in distributed systems. For instance, in systems based on the client-server model, a server may serve many clients and because of heavy loads, the server cannot respond to the requests on time. In such a case, replicating data or servers may improve performance. Moreover, data and processes can be replicated to protect against failures. However, this is a very complex procedure. In this paper, I propose a method, to make systems fault tolerant based on replication, by way of exploiting the use of collaborative agents. This method is also used to improve fault tolerance in multi-agent systems.Izmir Inst Technol, EE & CE Depts, Turkish Sci & Res Council, Izmir Branch Chamber Elect & Elect Engineer

    Improving the scalability of cloud-based resilient database servers

    Get PDF
    Many rely now on public cloud infrastructure-as-a-service for database servers, mainly, by pushing the limits of existing pooling and replication software to operate large shared-nothing virtual server clusters. Yet, it is unclear whether this is still the best architectural choice, namely, when cloud infrastructure provides seamless virtual shared storage and bills clients on actual disk usage. This paper addresses this challenge with Resilient Asynchronous Commit (RAsC), an improvement to awell-known shared-nothing design based on the assumption that a much larger number of servers is required for scale than for resilience. Then we compare this proposal to other database server architectures using an analytical model focused on peak throughput and conclude that it provides the best performance/cost trade-off while at the same time addressing a wide range of fault scenarios

    Solving atomic multicast when groups crash

    Get PDF
    In this paper, we study the atomic multicast problem, a fundamental abstraction for building faulttolerant systems. In the atomic multicast problem, the system is divided into non-empty and disjoint groups of processes. Multicast messages may be addressed to any subset of groups, each message possibly being multicast to a different subset. Several papers previously studied this problem either in local area networks [3, 9, 20] or wide area networks [13, 21]. However, none of them considered atomic multicast when groups may crash. We present two atomic multicast algorithms that tolerate the crash of groups. The first algorithm tolerates an arbitrary number of failures, is genuine (i.e., to deliver a message m, only addressees of m are involved in the protocol), and uses the perfect failures detector P. We show that among realistic failure detectors, i.e., those that do not predict the future, P is necessary to solve genuine atomic multicast if we do not bound the number of processes that may fail. Thus, P is the weakest realistic failure detector for solving genuine atomic multicast when an arbitrary number of processes may crash. Our second algorithm is non-genuine and less resilient to process failures than the first algorithm but has several advantages: (i) it requires perfect failure detection within groups only, and not across the system, (ii) as we show in the paper it can be modified to rely on unreliable failure detection at the cost of a weaker liveness guarantee, and (iii) it is fast, messages addressed to multiple groups may be delivered within two inter-group message delays only

    TCP-ABC: From Multiple TCP Connections to Atomic Broadcasting

    No full text

    An Architecture for Self-healing Autonomous Object Groups

    Get PDF
    Abstract. Jgroup/ARM is a middleware for developing and operating dependable distributed Java applications. Jgroup integrates the distributed object model of Java RMI with the object group paradigm, enabling construction of replicated servers that offer dependable services to clients. ARM aims to improve the dependability characteristics of systems through fault treatment, focusing on operational aspects where the gain in terms of improved dependability is likely to be the greatest. ARM offers two core mechanisms: recovery from node, object and network failures and distribution of replicas. ARM identifies failures and reconfigures the system according to its dependability requirements. This paper proposes an enhancement of the ARM framework in which replica placement is performed in a distributed manner, eliminating the need for a centralized manager with global information about all object groups. Instead each autonomous object group handles their own replica placement based on information from nodes. Assuming that multiple objects groups are deployed in the system, this constitutes a distributed replica placement scheme. This scheme enables the implementation of self-healing object groups that can perform fault treatment on themselves. Advantages of the approach: (a) no need to maintain global information about all object groups which is costly and limits scalability, (b) reduced infrastructure complexity, and (c) less communication overhead.

    A Fast Group Communication Mechanism for Large Scale Distributed Objects

    No full text

    Aquarius: A Data-Centric approach to CORBA Fault-Tolerance

    No full text
    The Internet provides abundant opportunity to share resources, and form commerce and business relationships. Key to sharing information and performing collaborative tasks are tools that meet client demands for reliability, high availability, and responsiveness. Many techniques for high availability and for load balancing were developed aiming at small t

    Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs

    No full text
    Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles)
    corecore